TRAINING RUN

TRSbench Unified Methodology

One leaderboard. Ten pillars. Every public benchmark that matters.

VERSION 1.3 | MARCH 2026

SECTION 1

What is TRSbench Unified?

TRSbench Unified is the single composite AI benchmark tracker that replaced the five separate Training Run leaderboards (TRSbench, TRUscore, TRAgents, TRFcast, TRScode). It tracks a weighted composite score measuring how well AI models perform across 10 pillars of capability, updated daily from 37 independent public sources.

Every score can be traced back to a public leaderboard. Every source link is listed below so anyone can verify the data independently. Transparency is not optional — it is the product.

SECTION 2

The TRSbench Unified Formula

COMPOSITE SCORE — TR2-UNIFIED-V1.3

TRSbench = Safety (16%) + Truth (14%) + Reasoning (13%) + Human Preference (11%) + Coding (11%) + Agent (10%) + Knowledge (8%) + Forecasting (7%) + Efficiency (5%) + Usage (5%)

PILLAR	WEIGHT	WHAT IT MEASURES	PRIMARY SOURCES
Safety	16%	Resistance to harmful prompts, safe behavior under adversarial conditions	HELM Safety, AIR-Bench
Truth & Confabulation	14%	Factual accuracy, hallucination resistance, truthful output	SimpleQA, FACTS, TruthfulQA, HalluHard, Vectara
Reasoning	13%	Logic, abstraction, multi-step problem solving	ARC-AGI-2, LiveBench, HELM Capabilities, HLE
Human Preference	11%	How humans rate model output quality in blind comparisons	Chatbot Arena, AlpacaEval
Coding	11%	Real-world software engineering and code generation	SWE-bench, EvalPlus, LiveCodeBench, BigCodeBench, Terminal-Bench, SciCode + 3 more
Agent Capability	10%	Autonomous task completion, tool use, multi-model coordination	GAIA, OSWorld, tau-bench, MCP Atlas, Galileo, OpenRouter
Knowledge	8%	Breadth and depth of factual knowledge	MMLU-Pro, HELM MMLU, SimpleQA
Forecasting & Finance	7%	Prediction accuracy, financial reasoning	ForecastBench, Rallies.ai, Alpha Arena, FinanceArena
Efficiency	5%	Speed, cost, throughput per dollar	Artificial Analysis, PricePerToken
Usage & Adoption	5%	Real-world developer adoption and API traffic	OpenRouter Rankings

SECTION 3

Why These Weights?

Safety 16%

The highest weight because a model that is unsafe is unusable regardless of capability. Safety is the floor, not a feature.

Truth & Confabulation 14%

A model that confidently fabricates information is dangerous in professional contexts. Factual reliability is second only to safety.

Reasoning 13%

Abstract reasoning and multi-step logic are the core differentiators between models. This is what makes a model useful for hard problems.

Human Preference 11%

Blind human evaluations capture quality dimensions that benchmarks miss: tone, helpfulness, instruction-following, and overall utility.

Coding 11%

Software engineering is the single highest-value use case for LLMs today. Nine sub-benchmarks ensure coverage across generation, debugging, and real-world repo tasks.

Agent Capability 10%

Autonomous agents are the future of AI deployment. Task completion, tool reliability, and multi-model coordination measure readiness for production agent workflows.

Knowledge 8%

Broad factual knowledge matters but is increasingly commoditized. Weighted lower than reasoning because retrieval augmentation can compensate for knowledge gaps.

Forecasting & Finance 7%

Prediction and financial reasoning are high-value verticals that test a model's ability to reason about uncertainty and real-world outcomes.

Efficiency 5%

Speed and cost matter for production deployment. A model that is 10x more expensive for 5% better results loses in practice.

Usage & Adoption 5%

Developer adoption is a market signal. Models that developers actually choose to use carry a signal that benchmarks alone cannot capture.

SECTION 4

All 37 Sources

Every source below is scraped daily by the TR2 Unified DDP. Click any link to see the original leaderboard and verify the data yourself.

Safety 16% — 2 sources

HELM Safety crfm.stanford.edu →
AIR-Bench crfm.stanford.edu →

Truth & Confabulation 14% — 5 sources

SimpleQA llm-stats.com →
FACTS Benchmark (Google/Kaggle) kaggle.com →
TruthfulQA llm-stats.com →
HalluHard halluhard.com →
Vectara Hallucination Leaderboard huggingface.co →

Reasoning 13% — 4 sources

ARC-AGI-2 arcprize.org →
LiveBench Reasoning livebench.ai →
HELM Capabilities crfm.stanford.edu →
HLE (Humanity's Last Exam) lastexam.ai →

Human Preference 11% — 3 sources

Chatbot Arena Overall arena.ai →
Chatbot Arena Text arena.ai/text →
AlpacaEval tatsu-lab.github.io →

Coding 11% — 9 sources

SWE-bench Verified swebench.com →
EvalPlus evalplus.github.io →
LiveCodeBench livecodebench.github.io →
SWE-rebench swe-rebench.com →
BigCodeBench bigcode-bench.github.io →
Terminal-Bench Hard tbench.ai →
SWE-bench Pro (Scale AI) scale.com →
SciCode scicode-bench.github.io →
Chatbot Arena Code arena.ai/code →

Agent Capability 10% — 7 sources

SWE-bench Verified (task completion) swebench.com →
GAIA hal.cs.princeton.edu →
OSWorld os-world.github.io →
tau-bench taubench.com →
MCP Atlas (Scale AI) scale.com →
Galileo Agent Leaderboard huggingface.co →
OpenRouter Rankings (multi-model) openrouter.ai →

Usage & Adoption 5% — 1 source

OpenRouter Rankings openrouter.ai →

SECTION 5

Qualification Rules

Models are assigned one of three tiers based on data coverage:

Verified

Scores in 7+ of 10 pillars. Composite score is highly representative.

Estimated

Scores in 4-6 pillars. Composite is directionally useful but gaps exist.

Minimal

Scores in 3 or fewer pillars. Listed for tracking but treat composite with caution.

SECTION 6

Data Integrity

The TR2 Unified DDP runs every day at 4:15 AM CST. It scrapes all 37 sources, normalizes scores within each source (top model = 100), averages across sources per pillar, then applies the weighted formula. Changes are committed to GitHub and the leaderboard updates automatically.

If a source is down or returns no data, the scraper logs it and continues. Missing sources do not penalize a model — scores are averaged only across sources that returned data for that model.

Model name matching uses a canonical roster with 150+ aliases to ensure variant names all map to the same model.

SECTION 7

Coverage Bonus

Models appearing in more pillars receive a coverage bonus: +3% per pillar above 5. A model with scores in all 10 pillars gets a 15% bonus. This rewards models that are broadly evaluated rather than cherry-picked on favorable benchmarks.

The old system used a dampener formula that penalized sparse models. The new coverage bonus is simpler and more transparent: you get your weighted score plus a flat bonus for breadth.