Benchmark · Benchmark protocol
Inside our AI calorie-accuracy benchmark
The full method, sampling and error math behind our headline AI calorie-accuracy benchmark — how we build the reference meal set from our 12,000-user study and compute MAPE.
Key takeaway
Drawing on a 2.5-year study of 12,000 users across 15 countries and 1.4 million data points, we compare app calorie estimates against reference meal values. The best app estimated calories within ±8% on average; error roughly tripled on the weakest app and was worst on mixed and non-Western dishes.
Why we measure against a real-world dataset
You cannot rank AI calorie accuracy from app-store reviews or screenshots. The only honest way is to know the true answer for a meal, then see how close each app gets — and to do that across the kind of food people actually eat, not a tidy demo set. So our accuracy benchmark draws on the meal data inside our 2.5-year study of more than 12,000 users across 15 countries, which generated over 1.4 million data points spanning nutrition, logging and meal information.
From that dataset we build a large reference meal set: meals whose true energy and macros are established from reference databases and, for a controlled subset, from weighed values. Each reference meal is then logged through every app’s photo flow.
Sampling: what’s on the plates
Because the meals come from real users in North America, Europe, Asia and South America, the benchmark is representative of real eating rather than demo food. It spans:
- Home-cooked single foods and composed plates
- Mixed bowls, curries, stir-fries and saucy dishes (the hard cases)
- Packaged snacks and branded items
- Restaurant and takeaway portions
- Western, East and South Asian, Latin American and Middle-Eastern cuisines
The international spread means we over-represent the hard cases on purpose — that’s where apps diverge and where errors hurt real users most.
Controls
For the controlled accuracy comparison, each reference meal is presented to every app under the same conditions — the same image, logged through each app’s photo flow — so the test is about the app’s model, not our camera technique. The breadth of the underlying dataset is what lets those controlled comparisons generalise beyond a single kitchen.
The error metric: MAPE
For each meal we compute absolute percentage error between the app’s calorie estimate and the reference value, then average across all meals to get mean absolute percentage error (MAPE). We use MAPE because it’s comparable across small and large meals and is intuitive: a MAPE of 10% means estimates are off by about a tenth of the true value on average.
Headline result
The best app landed at ±8% MAPE; the field spread from there to ±24%. Errors were consistently largest on mixed bowls, sauces and non-Western dishes. Full per-app numbers appear in our accuracy benchmark guide.
Limitations
No dataset captures every food or every condition a user will meet, and apps update their models over time. We re-run the benchmark against refreshed data at least twice a year and after major model updates, and we publish the date of the most recent run on every guide.