Researchers at MIT and the Institute Pasteur in France have developed a mathematical technique to allow whole genomes to be assembled on a personal computer within minutes.
This is a notable achievement as standard techniques for genome assembly use high-powered, expensive computers and take about 100 times longer.
“We can quickly assemble entire genomes and metagenomes, including microbial genomes, on a modest laptop computer,” said Bonnie Berger, the Simons Professor of Mathematics at the Computer Science and AI Lab at the Massachusetts Institute of Technology (MIT) and an author of the study in the journal Cell Systems, in a press statement.
“This ability is essential in assessing changes in the gut microbiome linked to disease and bacterial infections, such as sepsis, so that we can more rapidly treat them and save lives.”
It is now two decades since the human genome was sequenced and there have been significant advances in sequencing technology since then. However, although sequencing reads are getting longer and more accurate, assembling whole genomes continues to take time and large amounts of computer power to complete.
de Bruijn graphs are one of several methods used by bioinformaticians when assembling genomes and genetic sequence, particularly longer reads.
In this study, Berger and colleagues created what they call a ‘minimizer-space de Bruijn graph’ (mdBG) approach, which differs from what is currently used as it involves short sequences of nucleotides called ‘minimizers’ rather than individual nucleotides.
“Our minimizer-space de Bruijn graphs store only a small fraction of the total nucleotides, while preserving the overall genome structure, enabling them to be orders of magnitude more efficient than classical de Bruijn graphs,” explained Berger.
The team tested their method on human sequence data and found that it was able to accurately assemble genomes. Using their mdBG approach, they assembled a human genome in under 10 min using 10GB of RAM, something easily achievable on a normal laptop or desktop computer.
“In addition, we constructed a mdBG-based representation of 661,405 bacterial genomes, comprising 16 million nodes and 45 million edges, and successfully searched it for anti-microbial resistance genes in 12 minutes,” write the authors, showing how useful this technique could be for medical researchers.
The researchers in this study used PacBio high fidelity reads (1% error rate) and their algorithm is currently best adapted to these reads, although they hope to develop it further in future.
“We can also handle sequencing data with up to 4% error rates,” adds Berger. “With long-read sequencers with differing error rates rapidly dropping in price, this ability opens the door to the democratization of sequencing data analysis.”
For example, Oxford Nanopore ultra-long reads currently have error rates of 5-12%, but these are soon expected to drop to 4% or below.
“We envision reaching out to field scientists to help them develop fast genomic testing sites, going beyond PCR and marker arrays which might miss important differences between genomes,” says Berger.