More than 70 volunteers from 25 countries have helped researchers at the Icahn School of Medicine at Mount Sinai analyze data from the National Center for Biotechnology Information’s (NCBI) Gene Expression Omnibus (GEO) to discover new associations between genes, diseases, and drugs. An article published Monday in Nature Communications noted the new associations would not have been possible via a smaller number of unaided researchers or the use of computing.
The 70 researchers were recruited through a massive open online course (MOOC) taught on the Coursera MOOC by Professor of Pharmacological Sciences and Director of the Mount Sinai Center of Bioinformatics at the Icahn School of Medicine Avi Ma’ayan, Ph.D.
Omics repositories with their vast stores of raw gene expression data, often from thousands of studies, provide a ripe opportunity for integrative analysis to uncover new information that was not possible on original publication. This is because new insights are often gained by combining the datasets of separate studies in ways that were not possible upon original publication.
While computerized search engines can help to effectively automate a part of this process, they still rely on intensive, and time-consuming, curation to ensure accuracy of the data.
“There is an incredible amount of data stored in these databases, but much of it has not been fully explored,” said Dr. Ma’ayan. “The profiling and extrication of gene expression signatures is time-consuming and labor-intensive, and cannot be completely automated. By utilizing volunteers, so called ‘citizen-scientists,’ we were able to bring a much greater scale of human curation and quality control than we could have performed alone. By combining that human touch with automated programs, we could process much more data than would have been otherwise possible.”
In this crowdsourced project, student volunteers were asked first to identify relevant studies in the NCBI GEO database —in this case, studies in which single-gene or single-drug perturbations were applied to mammalian cells, or in which normal versus diseased tissues were compared. Once the studies were selected, the volunteers extracted metadata from the studies, and then computed differential expression using a custom-designed Chrome browser extension developed by the Mount Sinai researchers. The extracted gene signature data was then stored in a new database.
From there, Dr. Ma’ayan’s team used a portal designed in his Mount Sinai lab—called the Crowd Extracted Expression of Differential Signatures, or CREEDS— to analyze and visual these gene signatures. In total, more than 3,100 single-gene perturbations from more than 2,300 studies were submitted, as well as 1,238 single-drug perturbations from nearly 450 studies.
The effort shows one method to effectively combine both computerized data mining techniques, which often produce subpar gene signature data, with manual efforts that typically produce superior results. The result is data refined by human curation, without the heavy time commitment of manual curation only.
“We are grateful to the volunteers who helped demonstrate that citizen-scientists, working with researchers towards a common goal, can achieve remarkable results that have a real impact,” noted Dr. Ma’ayan. “Such collective efforts can help us discover new drugs, new causes of diseases, and new scientific knowledge.”
While many new relationships between genes, drugs, and diseases were identified, further hypotheses can be formed through additional analysis of the data, which Dr. Ma’ayan and his team have made available to the public on the CREEDS portal.