Fixing the calibration gap in disease models
Emmanuelle A. Dankwa led a review showing calibration reporting in 419 disease models is uneven and implementation code appears in only 20% of studies.
Mathematical models that simulate how infections spread are a cornerstone of public health planning. These “transmission-dynamic” models help researchers and policymakers estimate the impact of interventions like vaccines, treatment programs, or screening campaigns. But those models depend on many input numbers — parameters — and getting those numbers right is a process called calibration: adjusting parameters so the model matches observed data. Emmanuelle A. Dankwa and colleagues point out that when calibration is done poorly or described incompletely, model conclusions can be wrong, and other scientists cannot reproduce results. To get a clear picture of current practice and make reporting more consistent, the team developed a 15-item framework for describing how calibration was done. They then used that framework in a scoping review of published transmission-dynamic models for tuberculosis, HIV and malaria. The review covered peer-reviewed studies published between January 1, 2018 and January 16, 2024, identified by searches of relevant databases and websites. In total the authors found 411 eligible studies representing 419 calibrated models, and used their framework to record how calibration steps were chosen and reported.
The review turned up detailed patterns in the kinds of models and the ways they were calibrated. Most models were compartmental (74%, 309 of 419), while 20% (82) were individual-based models (IBMs). The main reason researchers calibrated parameters was that values were unknown or ambiguous (40%, 168 of 419); another common reason (20%, 85) was that reporting a parameter value was directly relevant to the scientific question beyond simply making the model run. Most studies used models to evaluate interventions (71%, 298 models). The team also found that the choice of calibration method depended on model structure and whether the model included randomness: method choice was significantly associated with model structure (p < 0.001) and stochasticity (p = 0.006). Approximate Bayesian computation was used more often with IBMs, while Markov-Chain Monte Carlo methods were more common for compartmental models. When the authors checked how completely studies reported calibration steps against their 15-item checklist, only 4% (18 models) included all items; 66% (277) reported 11–14 items; and 28% (124) reported 10 or fewer. Implementation code was the least commonly shared element, available for only 20% (82) of models.
These findings have practical consequences. Heterogeneous and incomplete reporting makes it harder for other scientists, reviewers or decision makers to judge whether model conclusions are robust. That matters particularly because most models reviewed were aimed at guiding intervention choices, where errors could affect real-world policy on TB, HIV and malaria. The 15-item framework proposed by Dankwa and colleagues offers a structured way to report calibration steps so that readers can understand what was tuned, why, and how well the tuned model matches data. Better reporting — and more routine sharing of implementation code — would make model results easier to reproduce, compare and update. The authors suggest that adopting a standardized reporting checklist could increase the credibility of modeling studies and support more informed, transparent decisions in infectious disease control.
Standardized reporting of calibration will help scientists reproduce and scrutinize model-based results that inform public health. Wider sharing of code and full calibration details could lead to clearer, more reliable guidance for TB, HIV and malaria policies.
Author: Emmanuelle A. Dankwa