New error model improves low-frequency variant calls in M. tuberculosis
Shandukani Mulaudzi and colleagues developed a low-frequency error model that, paired with FreeBayes, removes 49% of false variants while keeping under 1% of true variants.
Detecting minority variants within a single sample of M. tuberculosis is a technical challenge facing researchers who use short-read sequencing. These low-frequency variants can indicate the presence of more than one strain in a sample, but sequencing and analytical errors make it hard to separate real signals from noise. To address this, a team led by corresponding author Shandukani Mulaudzi carried out a benchmarking study focused on within-sample minority variant detection using short-read sequencing. The researchers compared tools and approaches for calling low-frequency variants, paying close attention to how read mapping, base and mapping quality metrics, and masking strategies affect results. Based strictly on their analysis, they developed a new low-frequency error model designed to filter the output of the best performing tool. They also tested their approach using strain mixtures to substantiate the ranking of tools and to ensure the benchmarks reflected realistic scenarios where multiple strains might be present. The work is presented as evidence to guide choices about tool selection, masking, and filtering when working with low-frequency variants in M. tuberculosis.
The study evaluated candidate variant calling tools and found a top performer whose output could still contain false low-frequency variant calls. To improve specificity, Shandukani Mulaudzi and colleagues created a filtering strategy built as a low-frequency error model that leverages read mapping and quality metrics to flag probable errors. They paired this new model with FreeBayes and measured its effect on variant calls. When applied to FreeBayes output, the model excluded 49% of false variants while excluding less than 1% of true variants, demonstrating a substantial reduction in false positives with minimal loss of true signals. The authors also used strain mixtures to substantiate their ranking of tools, showing that their benchmarking reflected situations where multiple strains coexist in a sample. In addition to the model itself, their analysis highlights the importance of masking and filtering steps alongside careful tool choice when calling low-frequency variants from short-read sequencing data.
The findings reported by Shandukani Mulaudzi and colleagues point to practical steps that groups working on M. tuberculosis sequencing can take to improve the reliability of low-frequency variant detection. By ranking tools and demonstrating that an error model combined with FreeBayes can remove a large portion of false positives while preserving nearly all true variants, the study provides concrete guidance on how masking, filtering, and tool selection interact. The new low-frequency error model is presented as a resource to exclude false positive low-frequency variant calls from FreeBayes output, and the use of strain mixtures in benchmarking lends credibility to the recommendations. Overall, the work frames a set of best practices for analysts who need to call minority variants from short-read sequencing in M. tuberculosis, emphasizing that improvements in post-calling filtering can meaningfully reduce erroneous calls without sacrificing true variant detection.
This work supports better laboratory and bioinformatics practices for calling low-frequency variants in M. tuberculosis. By cutting false positives while retaining true signals, the new model can make short-read sequencing results more reliable for research and surveillance.
Author: Shandukani Mulaudzi