← Back
TRAINING RUN

TRSbench Unified Methodology

One leaderboard. Ten pillars. Every public benchmark that matters.

VERSION 1.3  |  MARCH 2026


SECTION 1

What is TRSbench Unified?

TRSbench Unified is the single composite AI benchmark tracker that replaced the five separate Training Run leaderboards (TRSbench, TRUscore, TRAgents, TRFcast, TRScode). It tracks a weighted composite score measuring how well AI models perform across 10 pillars of capability, updated daily from 37 independent public sources.

Every score can be traced back to a public leaderboard. Every source link is listed below so anyone can verify the data independently. Transparency is not optional — it is the product.


SECTION 2

The TRSbench Unified Formula

COMPOSITE SCORE — TR2-UNIFIED-V1.3
TRSbench = Safety (16%) + Truth (14%) + Reasoning (13%) + Human Preference (11%) + Coding (11%) + Agent (10%) + Knowledge (8%) + Forecasting (7%) + Efficiency (5%) + Usage (5%)
PILLARWEIGHTWHAT IT MEASURESPRIMARY SOURCES
Safety16%Resistance to harmful prompts, safe behavior under adversarial conditionsHELM Safety, AIR-Bench
Truth & Confabulation14%Factual accuracy, hallucination resistance, truthful outputSimpleQA, FACTS, TruthfulQA, HalluHard, Vectara
Reasoning13%Logic, abstraction, multi-step problem solvingARC-AGI-2, LiveBench, HELM Capabilities, HLE
Human Preference11%How humans rate model output quality in blind comparisonsChatbot Arena, AlpacaEval
Coding11%Real-world software engineering and code generationSWE-bench, EvalPlus, LiveCodeBench, BigCodeBench, Terminal-Bench, SciCode + 3 more
Agent Capability10%Autonomous task completion, tool use, multi-model coordinationGAIA, OSWorld, tau-bench, MCP Atlas, Galileo, OpenRouter
Knowledge8%Breadth and depth of factual knowledgeMMLU-Pro, HELM MMLU, SimpleQA
Forecasting & Finance7%Prediction accuracy, financial reasoningForecastBench, Rallies.ai, Alpha Arena, FinanceArena
Efficiency5%Speed, cost, throughput per dollarArtificial Analysis, PricePerToken
Usage & Adoption5%Real-world developer adoption and API trafficOpenRouter Rankings

SECTION 3

Why These Weights?

Safety 16%

The highest weight because a model that is unsafe is unusable regardless of capability. Safety is the floor, not a feature.

Truth & Confabulation 14%

A model that confidently fabricates information is dangerous in professional contexts. Factual reliability is second only to safety.

Reasoning 13%

Abstract reasoning and multi-step logic are the core differentiators between models. This is what makes a model useful for hard problems.

Human Preference 11%

Blind human evaluations capture quality dimensions that benchmarks miss: tone, helpfulness, instruction-following, and overall utility.

Coding 11%

Software engineering is the single highest-value use case for LLMs today. Nine sub-benchmarks ensure coverage across generation, debugging, and real-world repo tasks.

Agent Capability 10%

Autonomous agents are the future of AI deployment. Task completion, tool reliability, and multi-model coordination measure readiness for production agent workflows.

Knowledge 8%

Broad factual knowledge matters but is increasingly commoditized. Weighted lower than reasoning because retrieval augmentation can compensate for knowledge gaps.

Forecasting & Finance 7%

Prediction and financial reasoning are high-value verticals that test a model's ability to reason about uncertainty and real-world outcomes.

Efficiency 5%

Speed and cost matter for production deployment. A model that is 10x more expensive for 5% better results loses in practice.

Usage & Adoption 5%

Developer adoption is a market signal. Models that developers actually choose to use carry a signal that benchmarks alone cannot capture.


SECTION 4

All 37 Sources

Every source below is scraped daily by the TR2 Unified DDP. Click any link to see the original leaderboard and verify the data yourself.

Safety 16% — 2 sources

Truth & Confabulation 14% — 5 sources

Reasoning 13% — 4 sources

Human Preference 11% — 3 sources

Coding 11% — 9 sources

Agent Capability 10% — 7 sources

Knowledge 8% — 3 sources

Forecasting & Finance 7% — 4 sources

Efficiency 5% — 2 sources

Usage & Adoption 5% — 1 source


SECTION 5

Qualification Rules

Models are assigned one of three tiers based on data coverage:

Verified

Scores in 7+ of 10 pillars. Composite score is highly representative.

Estimated

Scores in 4-6 pillars. Composite is directionally useful but gaps exist.

Minimal

Scores in 3 or fewer pillars. Listed for tracking but treat composite with caution.


SECTION 6

Data Integrity

The TR2 Unified DDP runs every day at 4:15 AM CST. It scrapes all 37 sources, normalizes scores within each source (top model = 100), averages across sources per pillar, then applies the weighted formula. Changes are committed to GitHub and the leaderboard updates automatically.

If a source is down or returns no data, the scraper logs it and continues. Missing sources do not penalize a model — scores are averaged only across sources that returned data for that model.

Model name matching uses a canonical roster with 150+ aliases to ensure variant names all map to the same model.


SECTION 7

Coverage Bonus

Models appearing in more pillars receive a coverage bonus: +3% per pillar above 5. A model with scores in all 10 pillars gets a 15% bonus. This rewards models that are broadly evaluated rather than cherry-picked on favorable benchmarks.

The old system used a dampener formula that penalized sparse models. The new coverage bonus is simpler and more transparent: you get your weighted score plus a flat bonus for breadth.