Deep Data and Precision Health

By Amir Bahmani

The 2000s have ushered in a new era of precision health and medicine based on advances in both utility computing and omic measurements. The revolution began, in part, in 2003 when scientists sequenced 92% of the human genome, and digital medicine kicked off when wearables and app-based wellness applications began debuting in 2007. In parallel, utility and distributed computing entered a new era with Amazon Web Service (AWS) and AWS Elastic Compute Cloud (EC2) in 2002 and 2006, respectively, followed by competition from Google and Microsoft Azure in 2008 and separating server administration from software development for the first time.

Why do we need personalized medicine and health?

Our health is the collective product of our genome, medical history, microbiome, environmental exposure, lifestyle, and diet. The human microbiome comprises over 39 trillion microbes, bacteria, and fungi, whose favorable compositions  are essential to our continued health. Meanwhile our body contains between 30-40 trillion cells with genetic codes in each cell that are composed of approximately 3 or 6 billion base pairs of DNA in the germ and somatic cells, respectively1. While every cell has a copy of the whole genome, each cell type, in turn, specializes its functions by only expressing  tissue-specific subsets of genes to form intricate relationships underpinning both health and disease status. Due to the differences in our microbiomes as well as genetics and health histories, every one of us has a unique body and health condition, needing a particular diet and personalized medicine to obtain and maintain optimal health.

What are the needs and challenges in digital medicine? 

Michael Snyder
Michael Snyder, PhD
Chair and Professor of Genetics
Stanford University

As the sequencing costs have dropped significantly, healthcare centers across the globe have started collecting more and more genomic data and other genomic information, for instance which particular genes are being expressed in different cell types (transcriptomic data). Healthcare institutions have suddenly found themselves with an enormous amount of digital data and new responsibilities in efficient large-scale data acquisition, data storage, data distribution, data analysis, and security. In our Deep Data Research Computing Center alone, we have collected over 2 petabytes of data for just one individual, Michael Snyder, PhD, chair and professor of genetics, who has been sequencing and monitoring every possible health change as a way of testing new technologies and how they can work together. As the primary costs of dealing with this kind of data are rapidly shifting from sequencing itself toward computation and storage, difficult challenges surrounding the security and scalability of medical applications have only increased in complexity and urgency.

Research labs and healthcare providers are under a lot of pressure to immediately deliver new software solutions to the market when there is clear demand, such as better electronic health records or using Fitbit data for primary care physician visits. However, with the major accomplishments over the last decades making headlines in newspapers around the world, the question still remains, why does it take so long for a research study to get to production? Like any complex environment that no one team designed, there are several reasons:

1. Team formation in schools of medicine

Comparing teams in the School of Medicine to soccer teams, data scientists are the forward players resulting in typical teams in the School of Medicine like that in Figure 1.A; everyone wants to be Christiano Ronaldo or Lionel Messi, but who is going to take care of the rest of the field?

Data scientist illustration

Figure 1: Team Formation in the School of Medicine now (A) and in future (B)

Having a solid team of data scientists only accelerates biomedical innovation, and we should continue training more data scientists as there are simply not enough to go around. However, at the same time, it remains vital that we also build cross-functional teams of scientists that can handle security, privacy, and scalability challenges simultaneously2-7.

2. Think about scalability at an early stage

Another common misconception is assuming someone outside academia can magically scale our tools. Tools/algorithms are like humans and each one has unique needs, especially as it gets tested outside of research environments and made ready for clinical deployment. For example, one tool can be memory-bound and need a lot of main memory while another is CPU-bound and needs many CPU cores. Whereas large-scale computing costs may not be a primary constraint in the research phase of a given project, it only becomes more important as we try to take it into the clinic and thus, in order to realize affordable and scalable solutions in precision medicine, we must also develop improved methods for piloting the deployment costs and complexity as a part of project planning and funding8.

3. Shared security model

Understanding the responsibility model in the cloud is another tricky topic. There is a common misconception that the security of any layer of technology stack is the responsibility of the customer or the cloud provider. In reality, however, there are multiple layers in the technology stack that have to be handled by both parties9. For example, in deploying virtual machines cloud providers are responsible for the vulnerabilities in infrastructure while customers are responsible for the security of guest operating systems (including updates and security patches addressing security vulnerabilities). These differences being unclear can lead to issues with application security and require costly fixes if they are discovered too far down the road to deployment.

4. Interoperability and federated computing

The idea of creating a central database within an organization to handle all datasets is a common idea but a poor assumption. Cloud services are continually evolving and providing new services. While one application on the cloud system performs well with current parameters, that does not necessarily mean another application on the same cloud will get the same result. To increase effective collaboration within an organization or between partner organizations, we need to create federated solutions that can use multiple different applications or clouds seamlessly. Such solutions will increase collaboration at a lower cost by minimizing data motion across platforms and reducing the costs of storage and egress while simplifying security and privacy risks10.


In conclusion, to deliver precision medicine at a faster pace, at least three major changes can help significantly transform digital medicine

1. Research:
recognizing security, scalability and interoperability as core research topics in the School of Medicine (e.g., hiring faculty members who are specialized in these areas and funding new research initiatives) and continue developing models that would sustainably support academic software development;

2. Leadership:
forming  cross-functional teams of experts to increase effective collaboration in such a critical field. We have to recognize the contribution of both engineers and MDs/biologists at the School of Medicine. We need hybrid leaders who understand both the engineering and medical fields; leaders who would listen to both sides and then bring people together from different backgrounds to the collaboration table;

3. Education:
Education is the key for transforming medicine. Educating engineers, MD’s and biologists should be the #1 priority and source of investment for any healthcare institution. We need personalized learning to best prepare the next generation of bioinformaticians and consequently deliver rapid personalized and accurate medicine!


1. Snyder, M. (2016). Genomics and personalized medicine: what everyone needs to know. Oxford University Press.
2. Bahmani, A., Alavi, A., Buergel, T., Upadhyayula, S., Wang, Q., Ananthakrishnan,
S. K., … & Snyder, M. P. (2021). A scalable, secure, and interoperable platform for
deep data-driven health management. Nature communications, 12(1), 1-11.
3. Mishra, T.,Wang, M., Metwally, A. A., Bogu, G. K., Brooks, A. W., Bahmani, A., … & Snyder, M. P. (2020). Pre-symptomatic detection of COVID-19 from smartwatch
data. Nature biomedical engineering, 4(12),1208-1220.
4. Alavi, A.,Bogu, G. K., Wang, M., Rangan, E. S., Brooks, A. W., Wang, Q., … & Snyder, M. P. (2022). Real-time alerting system for COVID-19 and other stress events using wearable data. Nature medicine, 28(1), 175-184..
5. Schüssler-Fiorenza Rose, S. M., Contrepois, K., Moneghetti, K. J., Zhou, W., Mishra, T., Mataraso, S., … & Snyder, M. P. (2019). A longitudinal big data approach for precision health. Nature medicine, 25(5), 792-804.
6. Li, X., Dunn, J., Salins, D., Zhou, G., Zhou, W., Schüssler-Fiorenza Rose, S. M., … & Snyder, M. P. (2017). Digital health: tracking physiomes and activity using wearable biosensors reveals useful health-related information. PLoS biology, 15(1), e2001402.
7. Hall, H., Perelman, D., Breschi, A., Limcaoco, P., Kellogg, R., McLaughlin, T., &
Snyder, M. (2018). Glucotypes reveal new patterns of glucose dysregulation.
PLoS biology, 16(7), e2005143.
8. Bahmani, A., Xing, Z., Krishnan, V., Ray, U., Mueller, F., Alavi, A., … & Pan, C. (2021). Hummingbird: efficient performance prediction for executing genomic
applications in the cloud. Bioinformatics, 37(17), 2537-2543.
9. Dotson, C.(2019). Practical cloud security: a guide for secure design and
deployment. O’Reilly Media.
10. Bahmani, A., Ferriter, K., Krishnan, V., Alavi, A., Alavi, A., Tsao, P. S., … & Pan, C. (2021). Swarm: A federated cloud framework for large-scale variant analysis. PLoS computational biology, 17(5), e1008977.


Amir Bahmani, PhD, is the Director of Stanford Deep Data Research Computing Center (DDRCC) and the Director of Science and Technology at Stanford Healthcare Innovation Lab (SHIL) and a lecturer at Stanford University. He has been working on distributed and parallel computing applications since 2008. 

Currently, Amir is an active researcher in the VA Million Veteran Program (MVP), Human Tumor Atlas Network (HTAN), the Human BioMolecular Atlas Program (HuBMAP), Stanford Metabolic Health Center (MHC) and Integrated Personal Omics Profiling (iPOP).

Also of Interest