DeepMind and EMBL Release Extensive, Open Database of Predicted Protein Structures

DeepMind and EMBL Release Extensive, Open Database of Predicted Protein Structures
Credit: Karen Arnott, EMBL-EBI

DeepMind, a Google AI offshoot, today announced the public launch of the “most complete and accurate database” yet of predicted protein structures.

The AlphaFold Protein Structure Database was created in collaboration with the European Molecular Biology Laboratory (EMBL). It covers around 20,000 proteins expressed by the human genome and is openly available to the scientific community. In addition, it launches with approximately 350,000 structures from 20 biologically-significant organisms such as E. coli bacteria, the fruit fly Drosophila, mice, zebrafish, the malaria parasite and tuberculosis bacteria.

Today, researchers involved with the project published a paper in Nature that provides the fullest picture of proteins that make up the human proteome. The paper’s lead author is Kathryn Tunyasuvunakool, a senior scientist at DeepMind, and the senior author is Demis Hassabis, DeepMind’s founder and CEO.

“DeepMind’s release of the AlphaFold Protein Structure Database with EMBL, Europe’s flagship organization for molecular biology, is a great leap for biological innovation that demonstrates the impact of interdisciplinary collaboration for scientific progress. With this resource freely and openly available, the scientific community will be able to draw on collective knowledge to accelerate discovery, ushering in a new era for AI-enabled biology,” said Paul Nurse, Nobel Laureate for Physiology or Medicine 2001, director of the Francis Crick Institute, and Chair of the EMBL Science Advisory Committee.

In December 2020, the Critical Assessment of protein Structure Prediction (CASP) crowned DeepMind as the benchmark solution to the 50-year-old grand challenge of protein structure prediction. “This will change medicine. It will change research. It will change bioengineering. It will change everything,” said Andrei Lupas, an evolutionary biologist at the Max Planck Institute for Developmental Biology in Tübingen, Germany, who was one of the experts that assessed the performance of different teams in the CASP challenge.

The AlphaFold Protein Structure Database more than doubles the number of high-accuracy human protein structures available to researchers. The ability to predict a protein’s shape computationally from its amino acid sequence – rather than determining it experimentally through years of laborious and often costly techniques – is already helping scientists to achieve in months what previously took years.

“AlphaFold’s predictions have helped accelerate our research into antibiotic resistance by finally solving experimental data that we’ve been stuck on for more than 10 years. The predictions were so accurate and precise that I initially thought I might have done something wrong with the setup!” says professor Marcelo Sousa, department of biochemistry, University of Colorado Boulder, one of the partners already using the database.

“The AlphaFold database is a perfect example of the virtuous circle of open science,” said EMBL Director General Edith Heard. “AlphaFold was trained using data from public resources built by the scientific community so it makes sense for its predictions to be public. Sharing AlphaFold predictions openly and freely will empower researchers everywhere to gain new insights and drive discovery.”

AlphaFold is already being used by partners such as the Drugs for Neglected Diseases Initiative (DNDi), the Centre for Enzyme Innovation (CEI), University of Colorado Boulder, and University of California San Francisco, which has used the database to study SARS-CoV-2 biology.

“This will be one of the most important datasets since the mapping of the Human Genome,” said EMBL Deputy Director General, and EMBL-EBI Director Ewan Birney. “Making AlphaFold predictions accessible to the international scientific community opens up so many new research avenues, from neglected diseases to new enzymes for biotechnology and everything in between. This is a great new scientific tool, which complements existing technologies, and will allow us to push the boundaries of our understanding of the world.”

The database and system will be periodically updated and the team plans to vastly expand the coverage to almost every sequenced protein known to science – over 100 million structures covering most of the UniProt reference database.

The methodology behind AlphaFold is reported in last week’s Nature. The open-source code to AlphaFold is also available for interested parties. The structures, as noted, are available at EMBL-EBI’s searchable database, which is open and free to all.