TRSbench Unified Methodology
One leaderboard. Ten pillars. Every public benchmark that matters.
VERSION 1.3 | MARCH 2026
What is TRSbench Unified?
TRSbench Unified is the single composite AI benchmark tracker that replaced the five separate Training Run leaderboards (TRSbench, TRUscore, TRAgents, TRFcast, TRScode). It tracks a weighted composite score measuring how well AI models perform across 10 pillars of capability, updated daily from 37 independent public sources.
Every score can be traced back to a public leaderboard. Every source link is listed below so anyone can verify the data independently. Transparency is not optional — it is the product.
The TRSbench Unified Formula
| PILLAR | WEIGHT | WHAT IT MEASURES | PRIMARY SOURCES |
|---|---|---|---|
| Safety | 16% | Resistance to harmful prompts, safe behavior under adversarial conditions | HELM Safety, AIR-Bench |
| Truth & Confabulation | 14% | Factual accuracy, hallucination resistance, truthful output | SimpleQA, FACTS, TruthfulQA, HalluHard, Vectara |
| Reasoning | 13% | Logic, abstraction, multi-step problem solving | ARC-AGI-2, LiveBench, HELM Capabilities, HLE |
| Human Preference | 11% | How humans rate model output quality in blind comparisons | Chatbot Arena, AlpacaEval |
| Coding | 11% | Real-world software engineering and code generation | SWE-bench, EvalPlus, LiveCodeBench, BigCodeBench, Terminal-Bench, SciCode + 3 more |
| Agent Capability | 10% | Autonomous task completion, tool use, multi-model coordination | GAIA, OSWorld, tau-bench, MCP Atlas, Galileo, OpenRouter |
| Knowledge | 8% | Breadth and depth of factual knowledge | MMLU-Pro, HELM MMLU, SimpleQA |
| Forecasting & Finance | 7% | Prediction accuracy, financial reasoning | ForecastBench, Rallies.ai, Alpha Arena, FinanceArena |
| Efficiency | 5% | Speed, cost, throughput per dollar | Artificial Analysis, PricePerToken |
| Usage & Adoption | 5% | Real-world developer adoption and API traffic | OpenRouter Rankings |
Why These Weights?
Safety 16%
The highest weight because a model that is unsafe is unusable regardless of capability. Safety is the floor, not a feature.
Truth & Confabulation 14%
A model that confidently fabricates information is dangerous in professional contexts. Factual reliability is second only to safety.
Reasoning 13%
Abstract reasoning and multi-step logic are the core differentiators between models. This is what makes a model useful for hard problems.
Human Preference 11%
Blind human evaluations capture quality dimensions that benchmarks miss: tone, helpfulness, instruction-following, and overall utility.
Coding 11%
Software engineering is the single highest-value use case for LLMs today. Nine sub-benchmarks ensure coverage across generation, debugging, and real-world repo tasks.
Agent Capability 10%
Autonomous agents are the future of AI deployment. Task completion, tool reliability, and multi-model coordination measure readiness for production agent workflows.
Knowledge 8%
Broad factual knowledge matters but is increasingly commoditized. Weighted lower than reasoning because retrieval augmentation can compensate for knowledge gaps.
Forecasting & Finance 7%
Prediction and financial reasoning are high-value verticals that test a model's ability to reason about uncertainty and real-world outcomes.
Efficiency 5%
Speed and cost matter for production deployment. A model that is 10x more expensive for 5% better results loses in practice.
Usage & Adoption 5%
Developer adoption is a market signal. Models that developers actually choose to use carry a signal that benchmarks alone cannot capture.
All 37 Sources
Every source below is scraped daily by the TR2 Unified DDP. Click any link to see the original leaderboard and verify the data yourself.
Safety 16% — 2 sources
- HELM Safety crfm.stanford.edu →
- AIR-Bench crfm.stanford.edu →
Truth & Confabulation 14% — 5 sources
- SimpleQA llm-stats.com →
- FACTS Benchmark (Google/Kaggle) kaggle.com →
- TruthfulQA llm-stats.com →
- HalluHard halluhard.com →
- Vectara Hallucination Leaderboard huggingface.co →
Reasoning 13% — 4 sources
- ARC-AGI-2 arcprize.org →
- LiveBench Reasoning livebench.ai →
- HELM Capabilities crfm.stanford.edu →
- HLE (Humanity's Last Exam) lastexam.ai →
Human Preference 11% — 3 sources
- Chatbot Arena Overall arena.ai →
- Chatbot Arena Text arena.ai/text →
- AlpacaEval tatsu-lab.github.io →
Coding 11% — 9 sources
- SWE-bench Verified swebench.com →
- EvalPlus evalplus.github.io →
- LiveCodeBench livecodebench.github.io →
- SWE-rebench swe-rebench.com →
- BigCodeBench bigcode-bench.github.io →
- Terminal-Bench Hard tbench.ai →
- SWE-bench Pro (Scale AI) scale.com →
- SciCode scicode-bench.github.io →
- Chatbot Arena Code arena.ai/code →
Agent Capability 10% — 7 sources
- SWE-bench Verified (task completion) swebench.com →
- GAIA hal.cs.princeton.edu →
- OSWorld os-world.github.io →
- tau-bench taubench.com →
- MCP Atlas (Scale AI) scale.com →
- Galileo Agent Leaderboard huggingface.co →
- OpenRouter Rankings (multi-model) openrouter.ai →
Knowledge 8% — 3 sources
- MMLU-Pro huggingface.co →
- HELM MMLU crfm.stanford.edu →
- SimpleQA (Knowledge) llm-stats.com →
Forecasting & Finance 7% — 4 sources
- ForecastBench forecastbench.org →
- Rallies.ai rallies.ai →
- Alpha Arena (Nof1) nof1.ai →
- FinanceArena financearena.ai →
Efficiency 5% — 2 sources
- Artificial Analysis artificialanalysis.ai →
- PricePerToken pricepertoken.com →
Usage & Adoption 5% — 1 source
- OpenRouter Rankings openrouter.ai →
Qualification Rules
Models are assigned one of three tiers based on data coverage:
Verified
Scores in 7+ of 10 pillars. Composite score is highly representative.
Estimated
Scores in 4-6 pillars. Composite is directionally useful but gaps exist.
Minimal
Scores in 3 or fewer pillars. Listed for tracking but treat composite with caution.
Data Integrity
The TR2 Unified DDP runs every day at 4:15 AM CST. It scrapes all 37 sources, normalizes scores within each source (top model = 100), averages across sources per pillar, then applies the weighted formula. Changes are committed to GitHub and the leaderboard updates automatically.
If a source is down or returns no data, the scraper logs it and continues. Missing sources do not penalize a model — scores are averaged only across sources that returned data for that model.
Model name matching uses a canonical roster with 150+ aliases to ensure variant names all map to the same model.
Coverage Bonus
Models appearing in more pillars receive a coverage bonus: +3% per pillar above 5. A model with scores in all 10 pillars gets a 15% bonus. This rewards models that are broadly evaluated rather than cherry-picked on favorable benchmarks.
The old system used a dampener formula that penalized sparse models. The new coverage bonus is simpler and more transparent: you get your weighted score plus a flat bonus for breadth.