Linking lab data reveals hidden infection insights in UK Biobank
Daniel J. Wilson shows linking SGSS to UK Biobank greatly increases infection records and genetic findings compared with HES, boosting infection research power.
Infections matter not only as causes of contagious illness but also as exposures that can influence long-term, non-communicable disease risk. Large research cohorts such as UK Biobank are designed to study those risks, but the power of that work depends on the quality and completeness of health records researchers can access. Daniel J. Wilson and colleagues set out to assess whether connecting microbiology laboratory data to UK Biobank would add value beyond the hospital data already available. Specifically, they evaluated the benefit of linking UK Biobank to the UK Health Security Agency’s Second Generation Surveillance System (SGSS), a national repository of microbiology results collected from roughly 200 microbiology laboratories across England. The team had previously described algorithms to link SGSS to UK Biobank and showed those linkages were useful during the COVID-19 pandemic. In this study they compared SGSS-derived infection information to the Hospital Episode Statistics (HES) clinical dataset that UK Biobank users already rely on, asking whether the lab-based SGSS records capture more infections and give better outcomes for research.
To test the added value, the researchers compared infection records and genetic analysis results drawn from SGSS against those from HES. They used Genome-wide association studies (GWAS) to evaluate how well microbiological diagnoses from SGSS and diagnostic codes from HES performed at identifying infection outcomes that could be linked to genetic variation. SGSS, which supports surveillance, outbreak detection, and antimicrobial resistance monitoring, contained far more infection records by participant than HES (82,888 versus 18,054). The extra records were particularly pronounced for bacterial infections, with two exceptions noted in the abstract: Helicobacter pylori and Mycobacterium tuberculosis. When the team ran GWAS, SGSS-derived data produced more genetic association hits (31 versus 12) and covered a larger range of pathogens (12 versus 8). These differences indicate that SGSS supplies richer, more granular microbiological information than HES diagnostic coding for many research purposes.
The findings have clear implications for researchers who use UK Biobank to study infection and its long-term health consequences. By showing that SGSS adds substantial scientific value above and beyond HES, the study supports integrating SGSS linkages into UK Biobank as a routine resource for future infection research. That integration would give investigators more complete infection histories, improve the ability to discover genetic associations with specific pathogens, and strengthen studies of how infections act as exposures that influence non-communicable diseases. The richer microbiology data could also bolster public health work tied to the cohort, reinforcing surveillance, outbreak detection, and antimicrobial resistance monitoring activities that depend on laboratory-confirmed diagnoses. Overall, the analysis led by Daniel J. Wilson argues that linked laboratory data can significantly expand the research potential of a large population cohort like UK Biobank.
Linking SGSS to UK Biobank will let scientists identify more infections and discover more genetic associations, improving research into infection-related disease risks. Public health agencies could use the richer linked data to strengthen surveillance, outbreak detection and antimicrobial resistance monitoring.
Author: Shang‐Kuan Lin