You type into an AI assistant: "Drive me from downtown to the airport, avoid tolls, make it scenic, stop for coffee." Sometimes the agent nails it. More often, it suggests a highway, forgets the coffee stop, or books you a route that takes 45 minutes longer than it should.

This week a team from Amap — Alibaba's mapping platform, used by hundreds of millions of people across China — released something that explains exactly why this keeps happening, and gives the entire AI community a reproducible way to measure it.

They call it MobilityBench. It is the first large-scale, truly realistic benchmark for LLM-based route-planning agents.

The Three Problems It Solves

Until now, testing AI navigation agents was messy. Most benchmarks used fake or overly simple routes that bore little resemblance to what people actually ask for. Live map services change every second, so results from one test run could never be reproduced in the next. And evaluations typically just checked "did it output a route?" rather than "did it actually respect everything the user asked for?"

MobilityBench fixes all three problems at once.

How It Works

MobilityBench is built on 100,000 real, anonymized queries that actual users made to Amap — voice and text requests from 350+ cities across 22 countries. These aren't synthetic prompts written by researchers. They're the messy, multi-constraint requests real people make every day.

The benchmark uses a deterministic API-replay sandbox. Instead of calling live map APIs during testing — which would give different answers every time traffic updates or a road closes — the team recorded exact API responses at the moment each query was made and replays them identically for every model tested. Think of it as freezing time so every model gets the exact same map data, the exact same traffic conditions, the exact same results.

The evaluation protocol goes well beyond "good route or bad route." It breaks performance into five pillars:

1
Instruction Understanding: Did it parse what the user actually wanted?
2
Planning: Did it sequence the right steps?
3
Tool Use: Did it call the correct APIs with the right parameters?
4
Decision Making: Did the final route satisfy every stated preference?
5
Efficiency: How many tokens and API calls did it take?

What the Results Show

The benchmark tested twelve models — including GPT-5.2, GPT-4.1, Claude Opus 4.5, Claude Sonnet 4.5, Gemini 3 Pro, Gemini 3 Flash, Qwen3-235B, Qwen3-32B, DeepSeek-V3.2, and DeepSeek-R1 — using two popular agent frameworks: ReAct (flexible, step-by-step reasoning) and Plan-and-Execute (structured upfront planning).

The results split cleanly. Current agents handle the basics well: simple point-to-point directions, traffic queries, and nearby-place lookups are largely solved. But performance drops sharply on Preference-Constrained Route Planning — the category that matters most to real users.

When users add constraints like "avoid highways," "scenic route under budget," "family-friendly," or "must stop for lunch," even the strongest models struggle. This is only 11.3% of the benchmark's task distribution, but it represents the gap between a navigation tool and a navigation agent that actually understands you.

ReAct-style agents outperformed Plan-and-Execute on overall success rates, but consumed significantly more tokens and took longer to complete. The classic speed-versus-accuracy tradeoff, now measured with real-world routing data for the first time.

MobilityBench Overview
Figure 1: Overview of MobilityBench — a systematic benchmark for evaluating route-planning agents. The framework combines real user queries from Amap, a deterministic API-replay sandbox, and a five-pillar evaluation protocol. Source: Song et al., 2026.

Why This Matters

Mobility is one of the highest-value real-world applications for AI agents. Better benchmarks here mean faster progress toward agents that are genuinely useful — not just impressive in demos.

For everyday users, this is the roadmap to an AI navigation assistant that actually understands your preferences rather than just optimizing for the fastest route. For developers and companies building agent products, MobilityBench provides the first fair, public, reproducible test for measuring real progress.

For the broader agent evaluation community, this benchmark fills a gap. Most existing agent benchmarks — GAIA, SWE-bench, tau-bench, OSWorld — focus on coding, desktop automation, or general knowledge tasks. None of them test multi-constraint preference satisfaction in a physical-world planning domain.

At TrainingRun.AI, we're closely evaluating MobilityBench as a potential addition to our TRAgents scoring formula — particularly for the Tool Reliability pillar, where real-world API grounding has been underrepresented. If the benchmark gains community adoption and independent validation over the coming weeks, expect to see mobility-specific metrics reflected in our agent rankings.

The entire benchmark — data, evaluation toolkit, and code — is open source on GitHub.

Read the original paper: "MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios" — arXiv: 2602.22638

Read the Full Paper →

What do you think — will standardized benchmarks like MobilityBench finally make AI agents reliable for real-world planning? Drop a reply on X. We read every one.

The TrainingRun.AI Team

David Solomon signature
David Solomon
david@trainingrun.ai