Big genomic data visualization
Credit: Nobi_Prizue/Getty Images

Middle Eastern researchers have designed a secure, machine-learning algorithm that they believe can protect the privacy of omic data for patients.

Omics are new, comprehensive ways to analyze complete genetic or molecular profiles such as the genomes or proteomes of an organism.  

A deluge of such data have been generated from massive research projects in the past few decades using high-throughput sequencing platforms.

But this has led to concerns over individual privacy and the potential that data could leak and lead to ethical problems like genetic discrimination. Risks are posed, not just by the data itself, but also through a wide scope of machine learning applications, particularly deep learning, and its use in diverse areas such as genomics, medical imaging, and healthcare.

Sensitive data may typically exist in a distributed manner, where the data owners do not want to share the raw data for privacy reasons, while the aggregators want to access enough data to improve model utility.

“To balance the needs of both, we need [a machine learning] framework for distributed data that can balance utility and privacy-preserving capabilities: a secure and privacy-preserving [machine learning] (PPML) method,” the researchers wrote in the journal Science Advances.

Xin Gao, PhD, a computer science professor at King Abdullah University of Science and Technology in Saudi Arabia, and co-workers attempted to develop just such a method.

They introduced the concept of differential privacy as a way to safeguard privacy by sacrificing a certain amount of data utility, ensuring continuous protection even if the model is compromised by an attacker. The team then systematically studied the privacy problem of multi-omics analysis in three biological scenarios.

Gao et al. proposed PPML-Omics as a way of achieving an effective trade-off between model performance and privacy-preserving capabilities.

They designed a decentralized version of the differential private federated learning (FL) algorithm, which was proposed in 2017 as a data-private collaborative learning method.

Essentially, the gradients of locally trained federated machine-learning models are obscured through differential privacy and decentralized randomization (DR) mechanisms before they are aggregated at a single and non-trusted party.

The researchers applied PPML-Omics to analyze and protect the privacy of real biological data from three challenging representative omic data analysis tasks, which were solved with three different but representative deep learning models.

They demonstrated how to address privacy concerns in cancer classification from The Cancer Genome Atlas Program with bulk RNA-sequencing, clustering with single cell RNA-sequencing, and the integration of spatial gene expression and tumor morphology with spatial transcriptomics.

In addition, they examined in depth the privacy breaches that existed in all three tasks through privacy attack experiments and demonstrated that patients’ privacy could be protected through PPML-Omics.

“In each of these applications, we showed that PPML-Omics was able to outperform methods of comparison, demonstrating the versatility of the method in simultaneously balancing the privacy-preserving capability and utility in omic data analysis,” the authors noted.

“Last, we proved the privacy-preserving capability of PPML-Omics theoretically, suggesting the first mathematically guaranteed method with robust and generalizable empirical performance in the application of protecting patients’ privacy in omic data.”

Also of Interest