Cloud computing has provided storage and computational resources at the scale necessary to harness the power of massive health datasets, and, with AI, has enabled researchers to make discoveries underlying cutting edge precision medicine. But some hurdles remain.
After the human genome was sequenced in 2003, it was another 11 years, via the landmark cancer research by Eric Lander and colleagues at the Broad Institute and large programs such as the Precision Medicine Initiative in the U.S., before the avalanche of genomic data being gathered began to mature from experimental use to actionable clinical insights.
The volume of collected genomic data is now gargantuan and continues to increase exponentially as the cost and resources required to determine an individual’s single sequence has fallen from millions to a few hundred dollars, and the time required is measured in days, not months. However, a single sequence requires about 200 gigabytes of storage, and it’s estimated that the total amount of genome-sequence data generated by 2025 will require about 40 exabytes of storage. One exabyte is equivalent to a billion gigabytes, and 40 exabytes is about eight times the storage required for all the words ever spoken by human beings.1
In addition to genomic data, advancing precision medicine may require research that draws on other omics data; electronic health records and clinical imaging; environmental, social, and behavioral data; research and trial data; medical literature; and, increasingly, data generated by portable and wearable devices.
Enter cloud computing, a term that describes storing and analyzing data at a remote data center, via the internet, rather than needing to have supercomputers on-site. Cloud computing’s power and scale makes it possible to store and work with a large volume of data, and house the technology required to make sense of it. Since the launch of Amazon Web Services in 2006, the technology has become ubiquitous in most aspects of our online lives. But, as Chris Dwan, a U.S.-based technology consultant who has worked in the precision medicine field for many years, comments, cloud platforms offer a crucial benefit for research beyond their massive storage and computing power. “One of the things that we struggle with in the medical and scientific space is that researchers and data exist siloed in a lab somewhere, [while] patients exist in a hospital or clinic somewhere,” he explains. Cloud computing helps break down silos by putting data “in the middle of the internet, equidistant from anyone and anywhere.”
Cloud computing and AI make it possible to collect, combine, store, compute, and analyze data, and build collaborations across borders—national or otherwise—between key participants. It does this via three notional layers: the platforms themselves, where the data and computational services reside; bioinformatics and infrastructure companies using cloud platforms or hybrid systems to provide access, analysis, and technical expertise between data guardians and end users; and researchers in institutions, biotechs, and pharma companies.
Amazon, Google, Microsoft, and Phillips are four big providers in the cloud platform space. Amazon launched Amazon Omics in 2022, calling it “a purpose-built service that will help bioinformaticians, researchers, and scientists store, query, analyze, and generate insights from genomic, transcriptomic, and other omics data.”2 The goal is to meet needs that frequently arise for researchers: for cost efficient ways to store easily-accessed omics and other health data; tools to compute and analyze that data across millions of individuals, especially to train machine learning models; tools that scale across very large samples while preserving accuracy, reliability, and data integrity and security.3
Customers can use this all-in-one service “off the shelf” or customize it to their needs. Importantly, it offers recognition and storage of raw sequence data outputted from sequencing tools, reproducible bioinformatics workflows to sequence that data, and an analytics component that, the company says in its press release, “simplifies analytics through query-ready variants (or mutations) and annotations.”
In precision medicine, more data generally means more (and better) insights, and combining different datasets is where the real power lies, Dwan explains. Platforms such as Amazon Omics offer tools to address challenges such as ingesting and working with information as disparate as omics data,4 electronic health records, and single cell imaging data, and combining those data in meaningful ways for use further along in the workflow.
Accessing and connecting data securely
However, accessing, storing, and using data in the first place remains a challenge in many situations. Few experts disagree that health data remains the property of the human subject, or at least under the care of a data guardian, such as hospitals and healthcare providers. In a climate of increasing concern about data privacy and security, that can present a barrier to researchers, biotechs, and pharma companies who want to use it.
Swiss-based BC Platforms operates in the middle tier between data holders and research end users and has developed a purpose-built global data network and infrastructure to address the access problem. The goal is to help researchers and scientists access and use healthcare data where it lives, without the challenges of exposing that data or moving it.
Nino da Silva, deputy managing director at BC Platforms, describes the company’s work as providing “data sherpas” that do the heavy data lifting, and finding alternatives to tricky-to-access siloed data, which he believes is no longer fit for purpose in precision medicine. Data guardians and researchers need easier ways to share and use data, without the barriers that might currently exist, he says. For instance, “in 2007, with the Horizon Project in Europe we started to develop federated learning technology, which today is very mature… the whole idea of true federation is you move the researchers and AI algorithms to the data, not the data to the researcher.”
Da Silva explains that in addition to fostering collaborations, building global data networks and using federated learning also delivers more viable research insights by avoiding data biases, such as race, gender, age, which tend to dominate datasets from a single location or region. BC Platforms remains focused on developing that capability. “Such networks and collaborations make it possible to allocate enough patients with a rare disease, see differentiation over a population as a whole, [and] make it possible to train algorithms and make AI actually work,” he says.
Data collaborations frequently use datasets from large projects such as the U.K. Biobank and the All of Us program in the U.S. Other collaborations might include organizations such as Health Data Research U.K. (HDRUK), an independent charity-based initiative whose stated aim is “to unite, improve and use health and care data as one national institute,” with work spanning academia, healthcare, industry, charities, and patients and the public.5
Cloud computing and AI have also fostered novel collaborations among industry players who, despite working in competition, have discovered the collective benefit of sharing technology and data, if only up to the point of commercialization. As Dwan says, “There are lots of projects where we see collaborations, even between big pharma companies, and certainly biotechs who are developing piece-wise technologies that get acquired by pharma companies. The pre-competitive space is definitely accelerating. And that has been hugely helped by cloud technologies.”
He explains that the system can work to everybody’s advantage thanks to what he refers to as “data de-militarized zones” built on top of public cloud providers. This allows for sharing and collaboration of data and technology not already in the public domain “without giving away the business crown jewels,” since cloud systems offer the ability to control who sees what, when, how, and what they can do with it, with access granted or revoked at the push of a button.
Peter Carr, a principal software architect at U.S.-based Lantern Pharma, a clinical stage oncology company, explains that the company uses its artificial intelligence and machine learning platform together with genomics to develop targeted cancer therapies, and with a higher chance of regulatory approval. “Imagine if I looked at a hypothetical universe of 10,000 lung cancer patients, and I knew precisely that 150 of these patients would be cured by my drug. I wouldn’t be able to get FDA approval, I’d have less than a 1% success rate. But if I could find a way to identify these patients ahead of time using AI and cloud computing and design a clinical trial around that, we can have near 100% success rate.”
Carr says that this would largely be impossible at scale for a small company without access to affordable cloud technology. “We use our AI platform and large sets of cancer research data that characterize the molecular, gene expression, and/or mutational profile of a tumor, and can allow us to compare tumor samples to normal samples in the same patient. So, we use these and other databases as our starting point… and it’s actually nearly impossible to analyse these datasets without a supercomputer. We’re a small company but we were able to access supercomputing using cloud platforms for probably less than some companies spend on coffee!”
Lantern also relies on cloud computing to collaborate with various potential partners: data guardians such as government-funded initiatives that have grown up since genome sequencing became affordable, academic researchers, or another company with a promising compound and real-world data to show how that compound behaves. “Such AI and data focused collaborations are key to the development of our platform and help us significantly de-risk the drug development process,” Carr says.
The infrastructure that Lantern has set up will also help to smooth future collaborations and can be scaled to meet demand as required, Carr adds, including onboarding new team members, adding new data, adding new data types, and adding new AI and ML methods, without a lengthy set-up process.
Overcoming hindrances with data and legacy systems
Carr mentions that data security is relatively easy from a technology standpoint and should ideally be transparent to end users. But he believes that several issues with data integration remain, such as data from legacy systems designed years ago with entirely different end goals in mind. “One of the real challenges we’re faced with is how to curate, clean and prepare the data into a state that it can be useful, and valid,” he says. “Who knows—five years from now, maybe AI will have helped solved this problem.”
Dwan uses the example of medical records, which contain patient metadata but are written with other clinicians in mind and lack meaning in a research setting. One way to get around this interoperability problem is data standards such as the FAIR6 principles (Findability, Accessibility, Interoperability, and Reuse), he says. “The principles emphasize machine-actionability (i.e., the capacity of computational systems to find, access, interoperate, and reuse data with none or minimal human intervention), because humans increasingly rely on computational support to deal with data as a result of the increase in volume, complexity, and creation speed of data”, the FAIR website states.
Dwan adds that another issue may be simple data input errors, or varied input from provider to provider. He suggests one solution could be a personal app carrying a patient’s core medical information in a standards-defined format, and accessible via a QR code at the point of care, thereby eliminating inconsistencies and errors. This would create joined-up medical information for patients moving between providers, and it also removes some significant data challenges facing bioinformaticians and data companies working in precision medicine.
Without doubt, finding simple solutions to such data challenges will be a win for both patient health and future precision medicine.
- National Human Genome Research Institute—Genomic Data Fact Sheet
- Part 1: Introducing Amazon Omics—from sequence data to insights, securely and at scale, Tehsin Syed and Taha Kass-Hout
- Accelerating Biotech Treatment Development with Amazon Omics
- Computational Methods for Single-Cell Imaging and Omics Data Integration, Ebony Rose Watson, Atefeh Taherian Fard and Jessica Cara Mar
- Health Data Research UK
- GO FAIR
Matt Williams, is a London-based freelance writer. He studied music at university but has worked in healthcare communications for the last 23 years, in both editorial and technical roles. He’s interested in the arts and lifestyle topics when he needs a change, and splits his time between London and Athens.