--:--
CATEGORIES
AUTHORS

Google’s Android Bench flunks most AI coders - Gemini 3.1 Pro barely passes

Google’s Android Bench ranks LLMs on real-world Android tasks; top model scores 72 %, exposing a 56-point gap.

Google’s Android Bench flunks most AI coders - Gemini 3.1 Pro barely passes

Google just handed AI report cards to the smartest kids in class — and even the valedictorian only got a 72.

The company released Android Bench, an official leaderboard that pits large language models against real-world Android tasks sourced from public GitHub repositories. The first results are a reality check: scores range from 16% to 72%, a 56-point spread that says more about the state of AI-assisted development than most vendor demos ever will.

Gemini 3.1 Pro sits at the top, trailed by Claude Opus 4.6. The benchmark throws breaking Android-version changes, Wear OS networking tasks, and Jetpack Compose migrations at each model the kind of problems developers hit every day, not contrived toy examples. Each fix is validated by unit or instrumentation tests, so a patch that merely looks right still fails if the build won't run.

Google Android Image

Google open-sourced the dataset, methodology, and test harness on GitHub, and built in canary strings alongside manual trajectory reviews to make data contamination harder to hide.

Kirill Smelov, Head of AI Integrations at JetBrains, endorsed the approach:

"Measuring AI's impact on Android is a massive challenge, so it's great to see a framework that's this sound and realistic. This methodology is exactly the kind of rigorous evaluation Android developers need right now."

The benchmark is deliberately narrow — no agent scaffolding, no chain-of-thought scaffolding, just raw code-fixing ability. That design choice keeps the numbers honest and gives vendors little room to spin a low score into a capabilities story.

Google says it will grow the task set in complexity and quantity over time, and encourages LLM makers to use the leaderboard to identify gaps and improve. The long-term goal: close the distance between what developers can imagine and what they can actually ship on Android.

Source: Google Blog