Futuristic flight through a digital line landscape, Blue dust particle
Credit: ko_orn / Getty Images

Researchers have created a new statistical framework that can improve the accuracy of machine-learning models, showing its value in areas such as predicting gene expression.

Prediction-powered inference allows the power and scale of a machine-learning system to be deployed while retaining confidence in its scientific conclusions, study findings in the journal Science suggest.

Machine learning has progressed rapidly in the past decade, enabling scientific phenomena to be predicted far more cheaply and quickly than with gold-standard scientific techniques.

The AI-powered systems have been used to predict 3D structures for vast catalogs of protein sequences and are now being used in proteomic studies. They are also being used in other scientific areas ranging from microclimate modeling to cancer prognosis.

“Predictions are not perfect, however, and this may lead to incorrect conclusions,” Tijana Zrnic, PhD, from Stanford University, and co-workers noted. “Moreover, as predictions beget other predictions, the cumulative effect can amplify the imperfections.”

The result is that it can compromise the validity of standard statistical approaches that are used to calculate confidence intervals and P values from gold-standard data.

Zrnic et al. therefore developed a way of combining the best of both worlds, in which predictions from high-throughput machine learning systems could be used with gold-standard data.

In this way, abundant data that was not always trustworthy could be amalgamated with trusted but scarce data to guarantee the statistical validity of the conclusions.

The team demonstrated the value of their approach on a broad range of datasets, including in one case relating protein structure with post-translational modifications (PTMs).

Specifically, they examined whether various types of PTMs occurred more frequently in intrinsically disordered regions (IDRs) of proteins.

The team combined the odds ratio between AlphaFold-based IDR predictions and PTMs on a dataset of hundreds of thousands of protein sequence residues with gold-standard IDR labels to give a confidence interval for the true odds ratio that was statistically valid.

To reject the null hypothesis that the odds ratio was no greater than one, prediction-powered inference required just 316 labeled observations compared with the classical approach which required 799.

The researchers also examined how a population of promoter sequences affected gene expression. In this case, they focused on estimating different quantiles of gene expression levels induced by native yeast promoters.

The prediction-powered confidence intervals for all three quantiles were much smaller than the classical intervals, the researchers reported. They also evaluated the number of labeled examples required by prediction-powered inference and classical inference, respectively, to reject the null hypothesis that the median gene expression level is at most five.

Prediction-powered inference required 764 examples while classical inference required 900, the team reported.

Other areas where prediction-powered inference was useful included galaxy classification, Amazonian deforestation, and relationships between income and both age and healthcare insurance.

The team concluded: “Given the growing number of settings with excellent predictive models and abundant unlabeled data, there is increasing potential for prediction-powered inference to benefit scientific research.”

Also of Interest