// PROTOCOL, IR-PHOTO-v1.0

AI Food-Photo Logging Methodology

Sub-protocol of the Independent Reviews rubric · Last updated May 23, 2026 · Lead: Sebastian Vance · Statistics: Mei-Lin Zhou

Scope. This document outlines the 30-plated-meal photo-AI benchmark utilized to evaluate each application on Independent Reviews that implements a photo-based logging process. It generates the AI-photo-recognition sub-score that contributes to the composite. The accuracy of photo-AI is assessed separately from the overall calorie accuracy protocol since photo-AI operates as an independent pipeline with unique failure modes.

1. Why a separate photo-AI protocol

The photo-AI logging process is particularly susceptible to unnoticed, confident mistakes. A misread barcode can be identified when the user sees the incorrect item displayed. A manual entry can be corrected when the user types and reviews it. In contrast, a photo-AI estimate inherently has the least user attention, as the user merely takes a picture and accepts the app's output. If the app indicates "Caesar salad with grilled chicken: 480 kcal" while the actual dish is "fettuccine alfredo with shrimp: 1,140 kcal," the user may never recognize the mistake. Over three weeks of such unnoticed errors, a 500 kcal/day deficit can shift to a 200 kcal/day surplus.

This photo-AI benchmark thus distinguishes three quantifiable failure modes: identification, portion estimation, and final calorie estimation, and evaluates each separately rather than merging them into a single score. An app may accurately identify a dish yet poorly estimate its portion; another may correctly size the portion but misidentify the dish; we aim to capture both aspects.

2. The 30-plated-meal sample

The benchmark consists of 30 plated meals that are prepared and weighed in a laboratory setting. The choice of 30 meals strikes a practical balance between statistical reliability (n=30 provides a manageable confidence interval on per-meal MAPE while adhering to the budget for monthly retesting) and the expense of standardizing plating and lighting for each individual meal.

Difficulty tier	n	Examples	What it stress-tests
Tier 1, single principal item	10	6 oz grilled chicken breast on a white plate; medium banana on a white plate; 1 cup cooked white rice in a bowl; whole avocado halved; 100 g almonds in a bowl	Baseline dish recognition under near-laboratory conditions. An app that fails Tier 1 has fundamental recognition issues.
Tier 2, composed plate, separable components	10	Chicken-rice-broccoli plate (distinct components); turkey sandwich + side salad; salmon + roasted potatoes + green beans; oatmeal bowl with sliced strawberries, almond butter dollop, chia sprinkle	Multi-item recognition, per-item portion evaluation, summation logic.
Tier 3, composite dish, ingredients fused	10	Lasagna (hidden ricotta, hidden béchamel); chicken tikka masala over basmati (cream-based sauce); vegetable stir-fry (oil content not visible); Caesar salad (dressing quantity not visible); shakshuka (hidden olive oil)	Inferential reasoning about concealed fat, sauce, oil, and calorie content from cooking methods, where photo-AI often struggles the most.

The complete 30-meal photo log, detailing each meal's weighed components, USDA-anchored reference kcal, and reference photograph, is made available as an open dataset (CC BY 4.0) alongside the per-app photo-AI results.

3. Standardised plating, distance, lighting

The effectiveness of photo-AI is highly influenced by the input image. To differentiate model performance from input variability, every test image is taken under controlled conditions. (Real-world degradation under varying conditions is addressed in a separate "field condition" sub-benchmark, summarized in §7.)

Fixture	Spec
Plate	10" round matte white ceramic, edge-to-edge unbordered. The same plate is used for all Tier 1 and Tier 2 meals. Matte white bowls (6.5") are used for bowl-format meals.
Background	Matte white photography sweep, devoid of surrounding objects, with no utensils in frame unless they are part of the meal-component analysis.
Lighting	Aputure Amaran 60d daylight-balanced LED panel, 5600K, 80% diffuser, positioned 1.2 m above the plate at a 75° angle from horizontal. Light meter reads 850 lux at the plate surface ±50 lux.
Camera distance	35 cm from lens to plate center. The phone is mounted on a Manfrotto Pixi mini tripod with an extension arm; the phone is not hand-held to eliminate tester-side framing variability.
Camera angle	Top-down, 90° to plate plane (overhead). A separate "user-realistic" 45° angle pass is captured for the field-condition sub-benchmark.
Device	iPhone 15 Pro, iOS 18.3, native camera resolution, no zoom, HDR enabled (default user behavior).
Plate composition	All components of each meal are weighed individually before plating; the arrangement of the plating is documented in the dataset's per-meal reference photo to allow for precise retests.

4. Per-app workflow

Each application is assessed based on its single-photo native workflow: open the app's photo logging interface, capture (or upload, see below) a single image, and accept the app's initial portion-estimate suggestion without any manual adjustments. The benchmark specifically excludes multi-photo workflows or correction loops, as the goal is to evaluate the workflow a standard user engages with, which involves taking one photo, accepting, and logging.

Mechanical details:

Photo capture vs upload. Applications that provide in-app camera capture are tested through in-app capture. Applications that only allow photo-library uploads are evaluated using the same canonical reference photo. The approach is based on what the app natively supports; we do not penalize an app for lacking in-app capture capabilities.
Multiple-suggestion lists. If the app presents a list of potential dishes, the tester chooses the top suggestion (position 1). The "any-of-top-3" measurement is documented separately for the database-quality sub-score but does not affect the primary photo-AI identification score.
Portion-estimate suggestion. The app's first proposed portion size is logged as accepted. If the app provides a slider or stepper to modify the portion, the slider remains at its default suggested position.
No manual override. The fallback guideline from the wider accuracy protocol (§4.1 of the calorie accuracy methodology) applies: the tester does not correct the app's output, as corrected output evaluates something different.

5. Per-meal scoring

For each (app × meal) combination, three independent sub-scores are documented:

Sub-score	Definition	Pass criterion
Identification accuracy	Did the app accurately name the primary dish (Tier 1, Tier 2) or correctly identify the composite dish (Tier 3)?	The top-1 returned dish name matches the canonical dish name (case-insensitive, allowing common synonyms, "salmon" ≈ "grilled salmon"; "chicken tikka masala" ≠ "butter chicken"). Evaluated against a fixed synonym list published in the dataset.
Portion accuracy	Is the app's estimated portion volume within ±20% of the actual weighed amount?	\|estimated_g − weighed_g\| / weighed_g ≤ 0.20. The ±20% limit aligns with the FDA manufacturer's tolerance benchmark and represents the standard pass threshold found in academic dietary assessment validation literature.
Calorie accuracy	Is the app's final recorded kcal within MAPE bands of the USDA-anchored reference?	Reported as continuous APE per meal; pooled across the 30 meals as photo-AI MAPE; no per-meal pass/fail threshold.

The three sub-scores are intentionally kept separate and not combined into a single per-meal pass/fail because they address different failure modes. An app that correctly identifies a chicken-and-rice plate, accurately estimates the rice portion within 5%, yet still miscalculates calories by 25% (due to incorrect USDA mapping for "rice, cooked") provides different insights compared to an app that misidentifies the dish as "fried rice" and is wrong by 25% as a consequence.

6. Composite-meal subscore (Tier 3)

Tier 3 meals (lasagna, tikka masala, stir-fry, Caesar, shakshuka, and five additional composite dishes in the battery) receive an additional composite-meal subscore that reflects the photo-AI pipeline's reasoning regarding hidden ingredients:

Hidden-fat detection. Did the app's estimate account for the dish's known cream/butter/oil content? An estimate for lasagna at 280 kcal/serving (based solely on marinara) versus 540 kcal/serving (factoring in béchamel, ricotta, and mozzarella) indicates a category-level reasoning failure, rather than a portion error.
Sauce volume. For dishes with sauces (tikka masala, alfredo, Caesar dressing): is the implied sauce volume within ±30% of the weighed sauce? (The ±30% range is broader than portion accuracy due to the inherent difficulty of visually judging sauce.)
Cooking-method inference. For meals where oil constitutes the hidden calorie load (stir-fries, sautéed vegetables): does the estimate reflect the visually-implied oil load?

The composite-meal subscore is reported independently in the per-app photo-AI accuracy report and does not contribute to the overall photo-AI MAPE; it highlights qualitative reasoning failure modes that the aggregated MAPE statistic does not capture.

7. Field-condition sub-benchmark

Actual users do not photograph their meals under controlled lighting. A parallel field-condition sub-benchmark assesses the same 30 meals under three additional sets of conditions: bright daylight (window-side, 11 am, north-facing), restaurant dim (250 lux, warm 3000K overhead), and typical kitchen overhead (4000K LED, 400 lux), taken at a 45° angle hand-held to emulate user behavior. Field-condition results are reported separately to illustrate photo-AI degradation; they do not affect the primary benchmark to maintain clarity.

8. App version pinning + retest cadence

Photo-AI applications frequently release model updates, sometimes monthly or even weekly through server-side model swaps that do not change the app's version string. This creates a measurement challenge that the lab addresses in two ways:

App version captured per-meal. Each per-meal log records the app's complete version string along with the date/time of capture. If vendors disclose server-side model versions in their changelogs, those are also documented.
Monthly retest requirement for leading photo-AI apps. Applications that prominently feature photo-AI as the main logging workflow (Nutrola, MyFitnessPal Meal Scan, Lifesum) are retested monthly. Apps where photo-AI is a secondary workflow are retested quarterly alongside the broader benchmark. Out-of-cycle retests are prompted by vendor-announced model updates within 14 days.

The dataset retains all previous monthly releases; we do not silently replace published photo-AI data when a new model is released.

9. Current cycle: IR-PHOTO-2026-Q2 (May release)

The ongoing photo-AI benchmark cycle (IR-PHOTO-2026-Q2, May 2026 release) evaluated the standardized studio battery against the four applications with active photo-AI features in the US App Store as of May 10, 2026. Headline pooled photo-AI MAPE values:

Nutrola: ±0.7% pooled MAPE across 30 meals (this aligns with the lab's main accuracy figure since Nutrola's photo-AI is its primary logging interface).
MyFitnessPal Meal Scan: ±9.7% pooled MAPE.
Lifesum Snap: Pooled MAPE not reported in the headline due to a high refusal rate on Tier 3 meals (the app declines to estimate approximately 30% of composite dishes); per-tier breakdown is published.
Lose It! Snap: ±7.7% pooled MAPE (note that Snap is a paid add-on and was evaluated on the paid tier).

These results are consistent with the broader IR-BENCH-2026-Q2 accuracy benchmark (40-meal mixed-workflow), which indicates the same rank order under the lab's no-manual-correction protocol.

10. Limitations

Studio conditions are applied only for the primary benchmark. Field-condition results are published separately to describe degradation without affecting the main signal.
Single-photo workflow only. Multi-photo and correction-loop workflows are typical user behavior but measure different aspects (UX-aided accuracy, not pure photo-AI accuracy) and are assessed under the UX pillar.
iOS-centric. Android photo-AI cross-checks are conducted quarterly; any differences in Android-specific photo-AI quality are noted in individual app reviews.
US-cuisine bias in the battery. The composite-dish tier is skewed toward dishes prevalent in US grocery and restaurant culture; an extension to include multiple cuisines is planned for the 2026 Q3 cycle.