AI chest X‑ray screening fails differently across countries
Ibrahim Shuaibu reports that a DenseNet-121 TB screening model showed opposite failure modes when tested across China, the USA and India.
Artificial intelligence promises faster tuberculosis (TB) screening using chest X-rays, but real-world deployment can stumble when a model trained in one place sees images from another. In a multinational external validation study led by Ibrahim Shuaibu, researchers tested how geographic domain shift — differences between the population and image characteristics where a model is trained and where it is used — affects safety and reliability. The team trained a convolutional neural network on a single public source, the Shenzhen chest X-ray dataset (China; total n=662), and then asked whether that single-source model would still perform well on very different patient groups. The work echoes World Health Organization (WHO) guidance that screening tools must maintain high sensitivity so infectious cases are not missed. Rather than only reporting internal success, the study deliberately moved the trained model to two external settings to stress-test its behavior and to reveal the kinds of errors that can arise when a model is deployed beyond its original domain.
The researchers used a DenseNet-121 model with transfer learning and trained it in two stages: head training followed by fine-tuning. To avoid anatomically implausible image changes they excluded horizontal flipping during training. Evaluation included an internal test set from China, an external balanced cohort from Montgomery County (USA; n=138), and an external TB-positive cohort from India (n=155). The India dataset served as a sensitivity stress test and did not include negative controls, so specificity and ROC-AUC were not computed for that cohort. Internal validation produced an Area Under the Curve (AUC) of 0.889 and accuracy of 85.6%. External testing revealed divergent failure modes: on the USA cohort sensitivity was high at 94.8% but specificity dropped to 43.7%, indicating many false positives; on the India TB-only cohort sensitivity collapsed to 52.3%, meaning 47.7% of confirmed TB cases were missed under domain shift. The team also explored model attention using Grad-CAM to understand where the network was focusing.
These results show that geographic domain shift does not cause a single predictable failure pattern — instead it can trigger opposite problems depending on setting. In a low-burden, balanced external cohort the model became overly conservative and flagged many healthy cases as positive, producing a false-positive surge; in a high-burden, TB-only setting the same model missed nearly half of confirmed cases. For screening tools where the WHO emphasizes avoiding missed infectious cases, a sensitivity collapse of this magnitude is clinically important. The study highlights the safety risks of deploying a single-source AI screening tool without local validation and calibration: models that look strong in internal testing (AUC 0.889, accuracy 85.6%) can behave very differently elsewhere. The authors’ work argues for routine external validation, local calibration, and caution before using such models for public health screening across diverse populations.
Deploying a TB screening AI trained on one population can either flood clinicians with false positives or miss many true cases, depending on where it is used. Local validation and recalibration are essential before clinical deployment to avoid these safety risks.
Author: Ibrahim Shuaibu