TRS Methodology
The Training Run Score (TRS) is a composite metric designed to provide a clear, comparable measure of AI model capabilities. This page documents our complete methodology with full source citations.
Last Updated: January 24, 2026
Executive Summary
The TRS aggregates performance data from established benchmarks into a single 0-100 score. We weight six dimensions based on their relevance to real-world AI utility:
TRS = (R × 0.25) + (C × 0.25) + (H × 0.20) + (K × 0.15) + (E × 0.10) + (S × 0.05)
Where R = Reasoning, C = Coding, H = Human Preference, K = Knowledge, E = Efficiency, S = Safety
1. Reasoning & Logic (25%)
What We Measure
The ability to solve novel problems requiring multi-step reasoning, logical deduction, and genuine understanding—not pattern matching from training data.
Primary Data Sources
ARC-AGI-2 (ARC Prize Foundation)
The Abstraction and Reasoning Corpus tests novel reasoning on tasks specifically designed to require thinking through new problems. Unlike traditional benchmarks, ARC tasks cannot be solved through memorization.
- Current best baseline: 31% accuracy
- With refinement loops: 54% accuracy
- Human average: 60% accuracy
GPQA Diamond (NYU, Anthropic, et al.)
Graduate-level science questions written by PhD experts, validated to be answerable by domain experts but challenging for non-experts.
Source: arXiv:2311.12022MATH (Hendrycks et al.)
Competition-level mathematics problems from AMC, AIME, and Olympiad competitions requiring multi-step mathematical reasoning.
Source: arXiv:2103.03874Score Calculation
We normalize each benchmark to a 0-100 scale where 100 represents the current state-of-the-art performance. The Reasoning score is a weighted average:
R = (ARC × 0.40) + (GPQA × 0.35) + (MATH × 0.25)
2. Coding Proficiency (25%)
What We Measure
Real-world programming ability: writing functional code, debugging, understanding codebases, and solving actual software engineering problems.
Primary Data Sources
SWE-Bench Verified (Princeton, OpenAI)
Real GitHub issues from popular Python repositories. Models must understand the codebase, identify the bug, and generate a working fix. Human-verified subset ensures accuracy.
Source: swebench.comHumanEval (OpenAI)
164 hand-written programming problems testing code generation from docstrings. Measures functional correctness via unit tests.
Source: arXiv:2107.03374MBPP (Google)
974 crowd-sourced Python programming problems designed to be solvable by entry-level programmers, testing basic programming proficiency.
Source: arXiv:2108.07732Score Calculation
C = (SWE-Bench × 0.50) + (HumanEval × 0.30) + (MBPP × 0.20)
3. Human Preference (20%)
What We Measure
What do actual users prefer when comparing model outputs? This captures qualities that benchmarks miss: helpfulness, clarity, tone, and overall satisfaction.
Primary Data Source
LMSYS Chatbot Arena (UC Berkeley)
The gold standard for human preference evaluation. Users engage in blind side-by-side comparisons, voting for the better response without knowing which model produced it. Over 1 million votes collected.
- Methodology: Elo rating system (like chess)
- Sample size: 1,000,000+ human votes
- Blind comparison: Users don't know which model is which
Score Calculation
We convert Elo ratings to a 0-100 scale where 100 = highest-rated model and scores are proportionally distributed based on Elo differences.
H = ((Model_Elo - Min_Elo) / (Max_Elo - Min_Elo)) × 100
4. Knowledge & Comprehension (15%)
What We Measure
Breadth and depth of factual knowledge across academic domains, plus the ability to comprehend and reason about complex information.
Primary Data Sources
MMLU (Hendrycks et al.)
Massive Multitask Language Understanding: 57 subjects ranging from STEM to humanities, from elementary to professional level. 14,000+ questions.
Source: arXiv:2009.03300TruthfulQA (Lin et al.)
Tests whether models generate truthful answers to questions where humans might be tempted to give false but popular answers.
Source: arXiv:2109.07958Score Calculation
K = (MMLU × 0.70) + (TruthfulQA × 0.30)
5. Efficiency & Cost (10%)
What We Measure
Performance per dollar. A model that achieves 90% of the best model's performance at 10% of the cost provides significant value—we quantify this.
Data Sources
API Pricing (Official Provider Documentation)
We track official API pricing from OpenAI, Anthropic, Google, and other providers. Prices are recorded at time of scoring.
- OpenAI: openai.com/pricing
- Anthropic: anthropic.com/pricing
- Google: cloud.google.com/vertex-ai/pricing
Score Calculation
We calculate a capability-adjusted cost score:
E = (Average_Benchmark_Score / Cost_Per_1M_Tokens) × Normalization_Factor
6. Safety & Reliability (5%)
What We Measure
Consistency, resistance to jailbreaks, refusal of harmful requests, and overall trustworthiness. Models that hallucinate or behave unpredictably are penalized.
Data Sources
Model Provider Safety Reports
We incorporate published safety evaluations from model providers where available, noting that these are self-reported.
Independent Red Team Evaluations
Where third-party safety evaluations exist (academic papers, independent audits), we incorporate these findings.
Note on Safety Scoring
Safety evaluation is an evolving field. Our safety scores should be considered directional indicators rather than comprehensive assessments. We update our methodology as better evaluation frameworks emerge.
Important Limitations
What TRS Does NOT Measure
- Future capabilities: TRS measures current performance, not trajectory or potential
- AGI proximity: We make no claims about artificial general intelligence
- Real-world deployment: Benchmark performance may not reflect production use
- Specialized tasks: Domain-specific applications may vary significantly
Known Biases & Limitations
- English-language bias in most benchmarks
- Potential training data contamination (models may have seen benchmark questions)
- Benchmarks may not capture emerging capabilities
- Self-reported safety data from providers
We are committed to transparency about our methodology's limitations. If you identify issues or have suggestions for improvement, contact us.
Update Frequency
TRS scores are updated weekly, typically published on Mondays. When benchmark data sources update or new models are released, we incorporate the changes in the next weekly update.
Full Citation List
Academic Papers
- Chollet, F. (2019). "On the Measure of Intelligence." arXiv:1911.01547 [Link]
- Hendrycks, D., et al. (2021). "Measuring Massive Multitask Language Understanding." ICLR 2021 [Link]
- Chen, M., et al. (2021). "Evaluating Large Language Models Trained on Code." arXiv:2107.03374 [Link]
- Jimenez, C.E., et al. (2024). "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" ICLR 2024 [Link]
- Rein, D., et al. (2023). "GPQA: A Graduate-Level Google-Proof Q&A Benchmark." arXiv:2311.12022 [Link]
- Chiang, W., et al. (2024). "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference." arXiv:2403.04132 [Link]
- Lin, S., et al. (2022). "TruthfulQA: Measuring How Models Mimic Human Falsehoods." ACL 2022 [Link]
- Hendrycks, D., et al. (2021). "Measuring Mathematical Problem Solving With the MATH Dataset." NeurIPS 2021 [Link]
- Austin, J., et al. (2021). "Program Synthesis with Large Language Models." arXiv:2108.07732 [Link]
Evaluation Platforms
- LMSYS Chatbot Arena: https://chat.lmsys.org/
- ARC Prize: https://arcprize.org/
- SWE-Bench: https://www.swebench.com/
- Papers With Code: https://paperswithcode.com/