
When the Human Genome Project first mapped the complete human genome in 2000, it became known as the GRCh37 (hg19) reference genome. Fast forward 13 years later, new technologies led to release of an updated version, GRCh38 (hg38) in December 2013 that uses ALT contigs to represent common complex variation, including HLA loci.
Yet, while hg38 has been widely available since its release seven years ago, hg19 remains widely used in research and clinical laboratories. But a new study reveals important discrepancies between the two, quantifying the impact of using the two reference genomes for identification of variants associated with rare and common diseases. Further, the study from the Human Genome Sequencing Center at Baylor College of Medicine provides strategies for transitioning from using the hg19 reference to hg38.
The Baylor researchers analyzed exome sequencing samples from 1,572 participants in the Baylor-Hopkins Center for Mendelian Genomics program. By calling variants on both the GRCh37 and GRCh38 references, the team identified single-nucleotide variants (SNVs) or insertion-deletions (indels). “We found that a total of 1.5% of SNVs and 2.0% of indels were discordant when different references were used,” they write.
Their analysis revealed nearly 77% of the discordant variants were clustered within sections of the genome with known assembly problems that the researchers called DISCordant Reference Patches (DISCREPs). These DISCREPs, compromising only 0.9% of loci targeted by exome sequencing, were enriched for segmental duplications, fix patch sequences, and loci known to contain alternate haplotypes.
The team identified 206 genes significantly enriched for these discordant variants, most of which were in DISCREPs and caused by multi-mapped reads on the reference assembly that lacked the variant call. Of the 206 genes, eight were implicated in Mendelian diseases and 53 associated with common disease phenotypes.
“We examined the impact of using the updated reference on Mendelian genes and pathogenic variants,” said Dr. Aniko Sabo, a senior author of the study and assistant professor at the Human Genome Sequencing Center. “We wanted to provide the list of 206 genes enriched with discordant variants and bring this issue to the attention of the labs working on these genes.”
This paper confirms earlier research on SNV variability between the two genomes. In 2019, researchers from the Division of Bioinformatics and Biostatistics at National Center for Toxicological Research found about 1.5% discordant SNVs HG19 or HG38.
This paper brings up larger questions about what the human reference genome should be. All of the genomes, including the most recent hg38, include data from a very small number of people and countries. For hg38 , 93 percent of the sequence comes from just 11 individuals and 70 percent from just one man. Several organizations raise this as an important limitation. They hope to move towards a more representative human reference genome to more ably capture human genome diversity from ethnically diverse populations.
In the meantime, the Baylor researchers emphasize the reporting differences in these 206 genes. “For variant interpretation in the 206 genes enriched for discordant variants, reference assembly differences should be accounted for in the analysis, especially when lifting over variant coordinates from one reference to the other,” said Dr. He Li, co-first author of the study and a postdoctoral associate at Baylor at the time of research.