Researchers at the UC Santa Cruz Genomics Institute have developed a new tool that can efficiently map individual sequencing reads to a pangenome reference representative of thousands of human genomes at a speed comparable to that of current standard methods that map sequencing data to a single reference genome.
Popular and inexpensive short-read DNA sequencing yields snippets of the genomic sequence in individuals that must be mapped to a reference genome sequence to determine their chromosomal locations and identify variances from the reference genome that may impact health. Exclusive reliance on a single linear reference sequence to identify genetic variations in genetically diverse human subpopulations inevitably introduces bias— the tendency to incorrectly map sequences that differ from the reference genome.
To counter this shortcoming of mapping to a single reference genome, the UC Santa Cruz team developed the new tool, dubbed Giraffe, which is a pangenome short-read mapper that can efficiently map sequencing data to a collection of haplotypes threaded through a sequence graph in order to eliminate bias and provide better genomic analyses. A single reference genome chooses to represent one version of a genetic variation leaving the other versions unrepresented. By making more broadly representative pangenome references practical, Giraffe can make genomics more inclusive. Using mathematical graphs to represent the relationships between different sequences, diverse genomes can be combined into a representative pangenome reference.
In a paper published on December 16 in the journal Science (“Pangenomics enables genotyping of known structural variants in 5202 diverse genomes”), the authors evaluated Giraffe’s efficiency and showed that it allows a more comprehensive characterization of genetic variations, needed increasingly in biomedical research and precision clinical practice.
Benedict Paten, PhD, associate professor of biomolecular engineering at UC Santa Cruz, associate director of the Genomics Institute, and corresponding author of the paper said, “We’ve been working toward this for years, and now for the first time we have something practical that works fast and better than the single reference genome. It’s important for the future of biomedicine that genomics helps everyone equally, so we need tools that account for the diversity of human populations and are not biased.”
Although the human genome sequence is nearly 99% identical in all humans, there are scattered differences in single letters of the code (single nucleotide variants, SNV), short stretches of additions (insertions), or omissions (deletions) that are collectively called “indels,” and larger, more complex structural variations that involve rearrangements of fifty or more letters of the code. These different types of changes in about 0.001% of individual human genomes not only account for our unique appearance but may also spell the difference between health and disease.
Structural variants are especially hard to find using a single reference genome, yet they can play an important role in some diseases. The average person has millions of SNVs and indels and tens of thousands of larger structural variants, yet structural variants affect a larger portion of the genome than SNVs and indels put together.
“The workhorses of genomics have been SNVs and short indels, because structural variants have been hidden from view,” said Paten. “Pangenomics is making structural variants visible so we can study them the same way we do SNVs and short indels. There are a lot of structural variants, and they can have a big impact, so this is critical for the future of genetic studies of disease.”
In the new study, the researchers built two human genome reference graphs using publicly available genomic data to evaluate the new tool, Giraffe. Jouni Sirén, PhD, a research scientist at the Genomics Institute and co-first author of the study pioneered many of Giraffe’s key algorithmic innovations. The evaluation shows Giraffe can accurately map new sequence data to thousands of genomes embedded in a pangenome reference as quickly as existing tools map to a single reference genome, while reducing mapping bias.
“Not only is the analysis better, but it is also as fast as current methods that use a linear reference genome,” said Jean Monlong, PhD, a postdoctoral researcher at the Genomics Institute and co-first author of the paper.
The researchers used Giraffe to map sequence reads from a diverse group of 5,202 people and determine their genotypes for 167,000 recently discovered structural variations. They estimated the frequency of different versions of these structural variants in the entire human population and within subpopulations. The authors showed that the frequency of some variants differs considerably between subpopulations and could be misinterpreted if analyzed only in individuals of one ancestry.
The researchers found that Google Health’s deep-learning variant caller (DeepVariant) could identify SNVs and indels more accurately using Giraffe’s alignments against a pangenome than it could using alignments against a single reference genome.
“A lot of structural variants have been discovered recently using long-read sequencing,” said Monlong. “With pangenomes, we can look for these structural variants in large datasets of short-read sequencing. It’s exciting because this will allow us to study those new structural variants across many people and ask questions about their functional impact, association with disease, or role in evolution.”
Together with others at the UC Santa Cruz Genomics Institute, Paten is currently working on a comprehensive human pangenome reference funded by the National Human Genome Research Institute. The researchers expect to make this resource available to the scientific community next year.