Seven Bridges said today that it will make the Simons Genome Diversity Project (SGDP) dataset available to researchers through both its cloud-based platform and Cancer Genome Cloud.
SGDP’s 35 terabytes of data includes whole genomes from 279 individuals representing 130 diverse populations worldwide, including indigenous populations on every inhabited continent. By geographical regions, the SGDP dataset consists of 44 Africans, 22 Native Americans, 27 Central Asians or Siberians, 47 East Asians, 25 Oceanians, 39 South Asians and 75 West Eurasians.
All samples have been sequenced at 34-83 fold, with an average coverage of 43-fold.
Included in the dataset are 5.8 million base pairs of high coverage non-reference sequence, that are not present in the human reference genome. Seven Bridges hopes those variants will yield potentially clinically relevant variation, just as deCODE/Amgen’s collection of Icelandic biobank samples led to the discovery, published in Nature Genetics on February 27, of a variant linked to heart disease—a 766-bp insertion that lies within an intron of SREBP-1 and was missing from the GRCh38 reference genome and 1000 Genomes Project data.
“If we feed this type of variation into our graph genome, which becomes more accurate with each new sequence, we can get better alignment accuracy and minimize reference bias,” Seven Bridges asserted today on its blog. “What other interesting non-reference sequences exist out there in the world? We look forward to seeing new insights emerge from exploration of the new sequences in SGDP.”
SGDP is the largest dataset of human genetic variation ever reported, according to the company.
“Partnering with Seven Bridges will put this diverse and unique dataset into the hands of more researchers, in turn, speeding the discovery process,” David Reich, Ph.D., of Harvard Medical School, one of the directors of the project, said in a statement. “The Seven Bridges platform and tools provide a new way for researchers all over the world to leverage our data and make new discoveries.”
Platform users, which the company says number in the thousands, can analyze SGDP data along with their own data and other large datasets including The Cancer Genome Atlas (TCGA) and the Cancer Cell Line Encyclopedia (CCLE).
SGDP’s goals include addressing a lack of diversity in available genomic data.
While population genomics studies such as the 1000 Genomes Project have taken steps to increase understanding of genetic diversity, they have remained biased towards European populations. The Simons Foundation has sought to address that gap by selecting samples with the explicit intent of capturing as much geographic, anthropological, and linguistic diversity as possible.
The dataset’s diversity is also intended to guide researchers toward understanding evolutionary pressures towards identifying important parameters in the search for disease-related genes.
Initial publication of data from SGDP came September 21, 2016, in Nature. The study cited a phylogenetic analysis of the sequences in SGDP, based on pairwise divergence per nucleotide, that showed greater genetic diversity within Africa than outside the continent. All non-African genomes appear to descend from a single group that split from the ancestors of African hunter-gatherers around 50,000 years ago, according to the dataset.
“Huge amounts of untapped genome diversity, especially in Africa, have the potential to accelerate precision medicine,” Seven Bridges added.