Hidden small proteins in tuberculosis revealed by proteogenomics
Christian H Ahrens and colleagues used long-read assemblies and proteogenomics to uncover previously missed small proteins and coding changes in Mycobacterium tuberculosis.
Assembling bacterial genomes may seem routine, but important pieces can still be missed when researchers rely on short-read sequencing. That problem matters a lot for Mycobacterium tuberculosis because standard annotations often overlook small ORF-encoded proteins (SEPs; ≤100 amino acids) that can play outsized biological roles. To address these blind spots, Christian H Ahrens and co-authors produced complete long-read assemblies for six clinical reference strains drawn from lineage 1 and the more pathogenic lineage 2. These finished genomes allowed a clearer look at genes that are absent or misassembled in short-read references, and provided a solid foundation for downstream functional work. The team coupled comparative genomic and proteogenomic analyses to search for proteins that were not in existing annotations. They also developed software to predict comprehensive sets of mycobacteria-specific PE and PPE family genes, including lineage-specific variants, so that family members that vary between strains would not be missed. By building this resource around complete genomes from clinically relevant lineages, the researchers aimed to uncover small but potentially important proteins that have been hiding in genomic blind spots.
The study combined complete long-read genome assemblies with mass spectrometry and careful discovery controls. Using parallel accumulation–serial fragmentation mass spectrometry on unfractionated cell extracts, the authors detected approximately two-thirds of each strain’s annotated proteome. They extended their proteogenomic framework across related strains and used entrapment strategies to rigorously control proteogenomic discovery rates, reducing false positives in new protein calls. As a result, they revealed 12–24 previously unannotated proteins per strain, predominantly SEPs, along with 56–60 alternative translation start sites and 9–17 expressed pseudogenes. Newly identified proteins included conserved and lineage-specific SEPs, an antitoxin, candidate antimicrobial peptides, and novel proteins showing evidence of purifying selection. The work also supplied tools for predicting PE and PPE family genes, including lineage-specific variants, ensuring that members of these important mycobacterial families were more comprehensively catalogued.
These findings show why complete genomes and proteogenomics matter for infectious disease research. By filling gaps left by short-read assemblies, the approach reveals small proteins and alternative coding events that standard annotation pipelines miss. SEPs, expressed pseudogenes, antitoxins, and candidate antimicrobial peptides are the kinds of molecules that can inform biology, suggest biomarkers, or point to new therapeutic leads. Applying these methods to phylogenomically selected clinical reference strains — rather than a single lab-adapted reference — helps capture lineage-specific variation that could affect pathogenicity or diagnostic recognition. For a WHO-listed critical bacterial pathogen, discovering 12–24 new proteins per strain and dozens of alternative starts is a substantial expansion of the known proteome and provides concrete candidate diagnostics or therapeutics for further study. Overall, proteogenomics paired with complete long-read assemblies offers a practical path to uncovering hidden components of the tuberculosis proteome.
More complete genome assemblies and proteogenomic validation can uncover new diagnostic markers and candidate therapeutic targets for tuberculosis. This approach may accelerate development of candidate diagnostics or therapeutics against a WHO-listed critical bacterial pathogen.
Author: Benjamin Heiniger