Executive Summary

The TRS aggregates performance data from established benchmarks into a single 0-100 score. We weight six dimensions based on their relevance to real-world AI utility:

TRS = (R × 0.25) + (C × 0.25) + (H × 0.20) + (K × 0.15) + (E × 0.10) + (S × 0.05)

Where R = Reasoning, C = Coding, H = Human Preference, K = Knowledge, E = Efficiency, S = Safety

1. Reasoning & Logic (25%)

What We Measure

The ability to solve novel problems requiring multi-step reasoning, logical deduction, and genuine understanding—not pattern matching from training data.

Primary Data Sources

ARC-AGI-2 (ARC Prize Foundation)

The Abstraction and Reasoning Corpus tests novel reasoning on tasks specifically designed to require thinking through new problems. Unlike traditional benchmarks, ARC tasks cannot be solved through memorization.

  • Current best baseline: 31% accuracy
  • With refinement loops: 54% accuracy
  • Human average: 60% accuracy
Source: arcprize.org

GPQA Diamond (NYU, Anthropic, et al.)

Graduate-level science questions written by PhD experts, validated to be answerable by domain experts but challenging for non-experts.

Source: arXiv:2311.12022

MATH (Hendrycks et al.)

Competition-level mathematics problems from AMC, AIME, and Olympiad competitions requiring multi-step mathematical reasoning.

Source: arXiv:2103.03874

Score Calculation

We normalize each benchmark to a 0-100 scale where 100 represents the current state-of-the-art performance. The Reasoning score is a weighted average:

R = (ARC × 0.40) + (GPQA × 0.35) + (MATH × 0.25)

2. Coding Proficiency (25%)

What We Measure

Real-world programming ability: writing functional code, debugging, understanding codebases, and solving actual software engineering problems.

Primary Data Sources

SWE-Bench Verified (Princeton, OpenAI)

Real GitHub issues from popular Python repositories. Models must understand the codebase, identify the bug, and generate a working fix. Human-verified subset ensures accuracy.

Source: swebench.com

HumanEval (OpenAI)

164 hand-written programming problems testing code generation from docstrings. Measures functional correctness via unit tests.

Source: arXiv:2107.03374

MBPP (Google)

974 crowd-sourced Python programming problems designed to be solvable by entry-level programmers, testing basic programming proficiency.

Source: arXiv:2108.07732

Score Calculation

C = (SWE-Bench × 0.50) + (HumanEval × 0.30) + (MBPP × 0.20)

3. Human Preference (20%)

What We Measure

What do actual users prefer when comparing model outputs? This captures qualities that benchmarks miss: helpfulness, clarity, tone, and overall satisfaction.

Primary Data Source

LMSYS Chatbot Arena (UC Berkeley)

The gold standard for human preference evaluation. Users engage in blind side-by-side comparisons, voting for the better response without knowing which model produced it. Over 1 million votes collected.

  • Methodology: Elo rating system (like chess)
  • Sample size: 1,000,000+ human votes
  • Blind comparison: Users don't know which model is which
Source: LMSYS.org Methodology Paper: arXiv:2403.04132

Score Calculation

We convert Elo ratings to a 0-100 scale where 100 = highest-rated model and scores are proportionally distributed based on Elo differences.

H = ((Model_Elo - Min_Elo) / (Max_Elo - Min_Elo)) × 100

4. Knowledge & Comprehension (15%)

What We Measure

Breadth and depth of factual knowledge across academic domains, plus the ability to comprehend and reason about complex information.

Primary Data Sources

MMLU (Hendrycks et al.)

Massive Multitask Language Understanding: 57 subjects ranging from STEM to humanities, from elementary to professional level. 14,000+ questions.

Source: arXiv:2009.03300

TruthfulQA (Lin et al.)

Tests whether models generate truthful answers to questions where humans might be tempted to give false but popular answers.

Source: arXiv:2109.07958

Score Calculation

K = (MMLU × 0.70) + (TruthfulQA × 0.30)

5. Efficiency & Cost (10%)

What We Measure

Performance per dollar. A model that achieves 90% of the best model's performance at 10% of the cost provides significant value—we quantify this.

Data Sources

API Pricing (Official Provider Documentation)

We track official API pricing from OpenAI, Anthropic, Google, and other providers. Prices are recorded at time of scoring.

Score Calculation

We calculate a capability-adjusted cost score:

E = (Average_Benchmark_Score / Cost_Per_1M_Tokens) × Normalization_Factor

6. Safety & Reliability (5%)

What We Measure

Consistency, resistance to jailbreaks, refusal of harmful requests, and overall trustworthiness. Models that hallucinate or behave unpredictably are penalized.

Data Sources

Model Provider Safety Reports

We incorporate published safety evaluations from model providers where available, noting that these are self-reported.

Independent Red Team Evaluations

Where third-party safety evaluations exist (academic papers, independent audits), we incorporate these findings.

Note on Safety Scoring

Safety evaluation is an evolving field. Our safety scores should be considered directional indicators rather than comprehensive assessments. We update our methodology as better evaluation frameworks emerge.

Important Limitations

What TRS Does NOT Measure

  • Future capabilities: TRS measures current performance, not trajectory or potential
  • AGI proximity: We make no claims about artificial general intelligence
  • Real-world deployment: Benchmark performance may not reflect production use
  • Specialized tasks: Domain-specific applications may vary significantly

Known Biases & Limitations

  • English-language bias in most benchmarks
  • Potential training data contamination (models may have seen benchmark questions)
  • Benchmarks may not capture emerging capabilities
  • Self-reported safety data from providers

We are committed to transparency about our methodology's limitations. If you identify issues or have suggestions for improvement, contact us.

Update Frequency

TRS scores are updated weekly, typically published on Mondays. When benchmark data sources update or new models are released, we incorporate the changes in the next weekly update.

Full Citation List

Academic Papers

  1. Chollet, F. (2019). "On the Measure of Intelligence." arXiv:1911.01547 [Link]
  2. Hendrycks, D., et al. (2021). "Measuring Massive Multitask Language Understanding." ICLR 2021 [Link]
  3. Chen, M., et al. (2021). "Evaluating Large Language Models Trained on Code." arXiv:2107.03374 [Link]
  4. Jimenez, C.E., et al. (2024). "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" ICLR 2024 [Link]
  5. Rein, D., et al. (2023). "GPQA: A Graduate-Level Google-Proof Q&A Benchmark." arXiv:2311.12022 [Link]
  6. Chiang, W., et al. (2024). "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference." arXiv:2403.04132 [Link]
  7. Lin, S., et al. (2022). "TruthfulQA: Measuring How Models Mimic Human Falsehoods." ACL 2022 [Link]
  8. Hendrycks, D., et al. (2021). "Measuring Mathematical Problem Solving With the MATH Dataset." NeurIPS 2021 [Link]
  9. Austin, J., et al. (2021). "Program Synthesis with Large Language Models." arXiv:2108.07732 [Link]

Evaluation Platforms