Machine learning predicts TB risk in HIV patients in Ethiopia
Getaye Tizazu Biwota led a study showing XGBoost can predict tuberculosis in HIV patients using routine clinical data and machine learning.
Tuberculosis (TB) is the most common comorbidity among people living with HIV/AIDS, and early diagnosis is vital to prevent deaths and complications. A team led by Getaye Tizazu Biwota analyzed medical records from Debre Markos, Ethiopia, to see whether computer models could spot which adults on antiretroviral therapy (ART) were most likely to develop TB. The researchers worked with a retrospective dataset of 5,392 HIV-infected individuals, looking at routine clinical and demographic information. Most records came from women—3,440 (63.8%)—and the data included a mix of address classifications, with 3,715 (68.9%) labeled as green and 1,677 (31.1%) labeled as yellow. The study focused on early identification of TB in people with HIV so that clinicians can act sooner to treat TB or prevent its complications. Instead of relying only on traditional rules or individual tests, the team tested several machine learning approaches to see whether patterns in existing data could predict TB incidence and help prioritize patients for further testing or preventive treatment.
To handle the fact that TB cases are less common than non-TB cases, the researchers used imbalance-correction techniques including SMOTE and ADASYN. They evaluated multiple algorithms: Random forest, decision tree, logistic regression, gradient boosting, K-nearest neighbors, and XGBoost. After additional class balancing with MOTE+ENN, the XGBoost algorithm emerged as the top-performing model, achieving about 82% accuracy and a 90% AUC in predicting TB incidence. The study identified which clinical features mattered most to the model: CD4 count and patient age were the strongest predictors. Other important factors included duration on ART, weight, sex, WHO clinical stage, address status, DSD category, and TB preventative treatment (TPT) status. By comparing seven methods on the same 5,392-patient dataset, the work showed that tree-based boosting with careful handling of class imbalance gave the best predictive performance for TB among adults on ART.
The findings suggest that machine learning tools like XGBoost could be practical aids in clinical settings with high TB-HIV burden. Using routine data already recorded in clinics—CD4 counts, age, time on ART, weight, sex, and WHO clinical stage—models can flag patients at higher risk so that health workers can prioritize diagnostic testing, start TB preventative treatment, or increase monitoring. In resource-limited environments such as parts of Ethiopia, these predictive models may help direct scarce diagnostic resources to those most likely to have TB, potentially reducing delays in diagnosis and related deaths. The study also highlights the importance of addressing data imbalance and choosing appropriate algorithms; when techniques like SMOTE, ADASYN, and MOTE+ENN are applied, predictive power improves. Overall, this work points toward practical, data-driven support for earlier intervention in TB-HIV co-infection, while relying on existing clinical information rather than new tests.
Health clinics could use the XGBoost model to identify HIV patients at higher TB risk and prioritize them for testing or TPT. This targeted approach may reduce missed TB cases and save resources in settings with high TB-HIV co-infection.
Author: Desalegn Meseret Tadele