// PROTOCOL, IR-SCORE-v1.0

Composite Scoring System

Sub-protocol of the Independent Reviews rubric · Last updated May 23, 2026 · Weights review chair: Sebastian Vance · Statistics: Mei-Lin Zhou · Nutrition-science gating: Helena Brandt

Scope. This document outlines how the lab's measurements across each pillar, calorie estimation accuracy, database integrity, photo-AI capabilities, macro tracking, user experience, and pricing, come together to generate a single composite score for each app. It serves as the reference for how a score like "Nutrola, 96.4/100" is derived, detailing the tie-breaking and exclusion criteria that influence ranked results.

1. The six pillars and their weights

Each calorie counter app that is ranked receives scores based on six weighted pillars. These weights remain consistent across all site rankings to ensure comparability across categories, and they undergo an annual review by Sebastian, Mei-Lin, and Helena. The upcoming review is set for August 2026; the weights have remained unchanged since version 1.0 was released in September 2025.

#	Pillar	Weight	Source protocol
1	Accuracy, calorie estimation MAPE	25%	Calorie accuracy v1.0 (40-meal weighed reference)
2	Database quality, entry curation + provenance	20%	Barcode v1.0 (60-product) + database-quality sub-protocol
3	AI photo recognition	20%	Photo-AI v1.0 (30-plated-meal)
4	Macro tracking accuracy	15%	Macro accuracy sub-protocol (40-meal × protein/carb/fat MAPE)
5	User experience	10%	UX scoring rubric (workflow speed, friction-of-correction, dark patterns)
6	Price & value	10%	Annual cost ÷ usable-feature count

The distribution of 25/20/20/15/10/10 illustrates the lab's belief that accuracy, along with the two pathways that generate it (database quality and photo-AI), represents the primary signal, accounting for 65% of the composite score. Macro tracking is placed at 15% as it relies on the accuracy of calorie estimates (accurate meal tracking is essential for precise protein-per-meal tracking). User experience and pricing share the remaining 20% since, while significant, they represent recoverable issues; a highly accurate app with a poor user experience can still be valuable with effort, whereas a poorly accurate app with an excellent user experience can mislead users significantly.

2. Rationale for these specific weights

The weights were determined in September 2025 in a formal meeting with Sebastian (chair), Mei-Lin, and Helena. Three alternative proposals that were considered but ultimately rejected are noteworthy due to frequent reader suggestions:

"Why not 40% for accuracy?" This was considered but rejected because if accuracy were weighted at 40%, an app with slightly better MAPE would dominate even if it performed poorly in other areas. The 25% weight ensures that accuracy remains a priority without being the sole focus.
"Why is UX given only 10%?" The lab's audience seeks to identify the most accurate app, not necessarily the most visually appealing one. While UX is important, it is not the primary reason readers choose a calorie tracker.
"Why is 'community features' not included as a pillar?" Community features can be manipulated, the lab cannot independently verify their effectiveness, and they do not influence tracking accuracy. We welcome differing opinions on this; readers who feel otherwise can reach out via email at editor@independent-reviews.org.

3. Scoring rubric for each pillar on a 0–100 scale

Each pillar is evaluated using a 0–100 scale prior to weighting. The scoring methods are pre-determined and published; there is no discretion given to analysts for individual apps.

3.1 Accuracy (25%)

The accuracy score is based on pooled MAPE derived from the 40-meal benchmark:

accuracy_score = max(0, min(100, 100 − (pooled_MAPE × 4)))

Anchor points: 0% MAPE → 100; 5% MAPE → 80; 10% MAPE → 60; 15% MAPE → 40; 25% MAPE → 0. The linear-with-clamp approach is intentionally strict, with each percentage point of MAPE resulting in a deduction of four points from the pillar score. The headline figures for the 2026 Q2 cycle correspond to: Nutrola 97.2 (MAPE ±0.7%); Cronometer 88.8 (±2.8%); MacroFactor 88.4 (±2.9%); Lose It! 69.2 (±7.7%); MyFitnessPal 61.2 (±9.7%).

3.2 Database quality (20%)

This is a composite score made up of four 0–25 sub-scores: coverage (hit rate from a 50-item search panel), verification (proportion of verified entries among sampled data), freshness (delay in updates for chain menus and reformulated products), and noise resilience (handling of ambiguous queries). These scores are summed to produce a 0–100 pillar score. The complete sub-rubric will be shared upon the release of the database-quality protocol.

3.3 AI photo recognition (20%)

This is derived from the photo-AI protocol: a weighted combination of top-1 identification (40 points), top-3 identification (20 points), portion-MAPE-derived score (30 points), and graceful-failure behavior (10 points). Apps lacking a photo-AI feature will have this pillar excluded, and the 20% weight will be proportionally redistributed among the remaining five pillars, with full disclosure in the review header.

3.4 Macro tracking accuracy (15%)

This score is based on pooled MAPE for protein, carb, and fat estimates from the same 40-meal set, using the same anchoring function as accuracy. An additional sub-score for tracking fiber, saturated fat, sugar, and sodium is included at 20% of the pillar weight.

3.5 User experience (10%)

This consists of five sub-dimensions, each rated from 0–20: speed of common tasks (median time to log a food item, save a meal, scan a barcode, log a photo); friction-of-correction (number of taps required to fix a mis-logged entry); accessibility (support for VoiceOver/TalkBack, font scaling, WCAG 2.2 AA color contrast on key screens); presence and frequency of dark patterns (interruptions by paywalls, hidden cancellation options, sub-traps); presence of patterns that may risk eating disorders (gamified streaks, leaderboard pressures, framing restriction as virtue, and Helena-gated).

3.6 Price & value (10%)

This score is determined by the annual cost in USD at the most common upgrade tier divided by the count of materially useful features provided by the app, normalized against the category median. The scoring method does not follow a "lowest price wins" criterion; a free app with an inadequate database for logging a proper meal does not achieve a score of 100. The pillar is driven by value rather than just the headline price.

4. The composite formula

The composite score is calculated as a simple weighted sum:

composite = 0.25 · accuracy + 0.20 · database + 0.20 · photo_ai + 0.15 · macros + 0.10 · ux + 0.10 · price

The final score is rounded to one decimal point and presented as the prominent "X / 100" figure in every ranked review and best-of listing. We do not apply curve-grading across rankings. An app that scores 78.3 in a category where the highest score is 81.2 will be listed as 78.3, not adjusted to a higher number for the sake of appearance. In contrast, the top score in a less competitive category is not adjusted downward.

5. Tie-breaking procedures

When two apps are within 1.0 point of each other on the composite score, the methodology outlines a deterministic tie-break process:

Higher accuracy pillar wins. Given the lab's editorial stance that calorie estimation accuracy is the primary signal, the app with the superior accuracy pillar score will win ties within a 1.0 composite point difference. This tie-break is applicable in 95% of instances.
If accuracy pillars differ by 0.5 points or less, the app with the better database-quality pillar will prevail (since database quality contributes to accuracy).
If both accuracy and database scores are within 0.5 points, the app with the superior photo-AI pillar will win.
If all three scores are within 0.5 points, both apps will be presented as tied, with explicit "tied" labels in the ranking list. We do not arbitrarily choose one over the other.

This tie-breaking rule is implemented automatically by the ranking script; analysts do not have discretion in this process.

6. Criteria for exclusion, what does not receive ranking

Not every calorie counter app available in the US App Store qualifies for ranking. The criteria for exclusion are fixed and applied prior to the ranking process:

No US-locale, English-language version. Apps that are only available outside the US App Store, or those that lack an English-language version for the US, are not included in the lab's current coverage scope. (We do not have the budget to test non-English nutrition databases.)
App acquired or discontinued during the test cycle. If an app is acquired and an end date is announced, or if the vendor shuts it down during a benchmark cycle, the numbers collected will not be published as a ranking; they will be included in the dataset with a status note. We do not rank apps that won’t be available when the reader intends to install them.
App has not received an update in over 18 months. Outdated apps where the vendor has not released an update in more than 18 months are excluded from ranked coverage; they remain in the dataset for archival purposes.
Vendor denies access to a paid tier necessary for testing. If a feature requires access to a paid tier and the lab cannot purchase it (due to geographic restrictions, vendor stopping new sign-ups, etc.), the relevant pillar is excluded and the weight redistributed, with full disclosure in the review header.
App is under active review for eating disorder risks by Helena. If Helena identifies harmful framing or features within a vendor's app (such as gamified extreme restriction patterns, calorie-cutting driven by leaderboards, or marketing targeted at under-13 audiences), ranked coverage is paused pending a lab review. The app can still be discussed in editorial coverage, but it will not receive a numeric composite score.

Exclusions are noted for each cycle in the published dataset's notes section. Excluded apps are clearly identified along with their reasons for exclusion.

7. External validation cross-referencing

When peer-reviewed studies on dietary assessment validation exist for an app or class of apps, the lab cross-references these studies and either reports agreement or, if our findings diverge from published results, explicitly states this and offers a methodological rationale. The current external reference set includes:

Consumer Reports 2017 app accuracy testing.
JAMA Network Open 2024 nutrition-tracking app validation.
Dietary Assessment Initiative 2026 May validation, the largest validation cycle conducted by DAI to date, with a ±1.2% pooled MAPE across 244 patients (624 paired observations), and an 86-nutrient panel including all major macros, achieving a 12-week adherence rate of 96%. The DAI 2026 May framework is the most rigorous external benchmark available for AI-assisted dietary assessment within a clinical-grade reference cohort.

When our pooled MAPE differs from published validation, we openly disclose the discrepancy. Methodological differences (like sample size, meal composition, and allowances for manual corrections) typically explain these variations and are discussed in the individual app accuracy reports.

8. Score recomputation and historical records

Apps that are retested in a subsequent benchmark cycle will have their composite scores recalculated based on the new pillar inputs. Previous composite scores will remain accessible in the per-cycle dataset releases; the per-app review page will display the current score along with a "score history" panel detailing prior cycle results. We do not overwrite previous numbers without notice, and changes greater than 5 composite points between cycles warrant a dedicated editorial note in the per-app review.

9. Limitations

The weights reflect an editorial decision. Reasonable individuals might assign them differently; an app that performs poorly under our weighting could excel under another system. We publish the per-pillar scores so that readers who disagree with our weighting can adjust accordingly.
The composite score is a single figure; it cannot encompass every aspect of compatibility between an app and a specific user (considering clinical context, dietary preferences, and accessibility needs). The prose in the per-app review conveys the complexity that the composite cannot capture.
The exclusion criteria are applied proactively. Apps that currently meet the criteria but later fail to do so (for instance, if a vendor announces an impending shutdown) will be removed from ranked coverage in the next cycle.