PAPER 23 Apr 2025 Global

Faster, scalable TB genome analysis with MTBseq-nf wrapper

Abhinav Sharma developed MTBseq-nf, a Nextflow wrapper that makes MTBseq at least twice as fast for datasets over 20 samples while improving scalability and reproducibility.

Tuberculosis researchers have relied on whole-genome sequencing (WGS) to learn about Mycobacterium tuberculosis for years, but turning raw sequencing data into useful results requires reliable bioinformatics pipelines. The original MTBseq pipeline, published in 2018 and made publicly available on GitHub, was the first tool to offer a full WGS analysis for M. tuberculosis, covering quality control through mapping, variant calling for lineage classification, drug resistance prediction, and phylogenetic inference. However, that original architecture was built as a more linear, batched system and was not ideal for modern high-performance computing (HPC) clusters or cloud environments where very large datasets are common. To address that gap, Abhinav Sharma and collaborators created MTBseq-nf, a Nextflow wrapper around the existing MTBseq pipeline. The goal was straightforward: keep the trusted analysis steps of MTBseq but reorganize them so the pipeline can run faster and more efficiently on larger, distributed computing setups while remaining user-friendly.

MTBseq-nf changes how steps are run by enabling parallel execution of the same analysis step across samples, rather than processing samples in a single linear batch as in the TBfull step of the MTBseq pipeline. This parallelization lets the workflow make full use of available CPU and cluster resources. To test scalability and reproducibility, the team used 90 M. tuberculosis genomes (European Nucleotide Archive - ENA accession PRJEB7727) and ran benchmarks on a dedicated computational server. In these experiments, the execution time of MTBseq-nf in parallel analysis mode was at least twice as fast as the standard MTBseq pipeline for more than 20 samples. The wrapper also supports reproducibility and platform independence by integrating with nf-core, bioconda, and biocontainers. According to the authors, MTBseq-nf is user-friendly, optimized for hardware efficiency, scalable for larger datasets, and exhibits improved reproducibility compared with the original pipeline.

The practical implication of MTBseq-nf is that labs working on tuberculosis genomics can analyze larger collections of genomes more quickly without having to redesign their analysis steps. By using Nextflow as a wrapper, MTBseq-nf adapts the established MTBseq analysis to environments like HPC clusters and cloud services where parallel jobs and containerized tools are the norm. The increased speed—especially for projects with more than 20 samples—means researchers can iterate analyses faster, pursue larger surveillance or research studies, and better exploit modern compute hardware. Built-in support for nf-core, bioconda, and biocontainers also makes it easier for different groups to reproduce results and to run the same pipeline across different platforms, lowering technical barriers and helping teams share validated workflows for WGS-based lineage classification, drug resistance prediction, and phylogenetic inference in Mycobacterium tuberculosis.

Public Health Impact

MTBseq-nf lets research groups process large tuberculosis WGS datasets faster and more efficiently, making better use of HPC and cloud resources. Improved reproducibility through nf-core, bioconda, and biocontainers helps labs share and rerun analyses across platforms.

Tuberculosis genomics
MTBseq-nf
Nextflow
Whole-genome sequencing
Bioinformatics reproducibility
{% if expert_links_html %}
Featured Experts

Author: Abhinav Sharma

Read Original Source →