AI Detects Disease Clues from DNA Methylation Data

Artificial intelligence (AI) holds immense potential to transform healthcare by unraveling intricate patterns within complex biological data, laying the foundation for personalized medical solutions. Researchers at Linköping University in Sweden have pioneered an AI-driven method to analyze epigenetic data, with wide-ranging applications in medicine and biology. Their advanced models not only accurately estimate individuals’ chronological age but also discern their smoking history.

In the realm of genetics, numerous factors come into play, dictating which genes are activated or deactivated at any given moment. These factors encompass lifestyle choices such as smoking, dietary habits, and exposure to environmental pollutants. This intricate orchestration of gene activity, known as epigenetics, can be likened to a master switch controlling the expression of specific genes without altering the genetic code itself.

In a previous post, we discussed how AI is being used to enhance gene editing for epigenetic therapies. We have also discussed how researchers are using AI to develop deep-learning programs to predict single-cell transcription factor binding, drawing from extensive genomic and epigenetic data.

In this study, Linköping University researchers harnessed DNA methylation data from over 75,000 human samples to train a diverse array of AI neural network models. These models hold the promise of enabling precision medicine, allowing for tailor-made treatments and preventative strategies tailored to each individual’s unique genetic makeup. What sets their models apart is their “autoencoder” nature, which autonomously organizes and identifies intricate interrelation patterns within vast datasets.

DNA methylation (DNAm) modifications are reliable markers for assessing long-term environmental effects. They’re well-suited for large-scale studies due to their stability and variability. Researchers use DNAm to estimate chronological age accurately and gauge the impact of factors like smoking on gene expression and lung function. DNAm also plays a significant role in autoimmune diseases such as systemic lupus erythematosus, influencing disease activity and susceptibility.

To validate their models, the Linköping researchers conducted a comprehensive comparison with existing models. These existing models primarily rely on well-established epigenetic markers known to correlate with specific health conditions. For instance, some models can distinguish between current, former, and non-smokers based on enduring epigenetic traces left by tobacco consumption. Others can estimate an individual’s chronological age or categorize individuals as healthy or afflicted with a particular disease based on their epigenetic markers.

The researchers’ autoencoder models were put to the test, addressing three distinct queries: age estimation, smoking history determination, and the diagnosis of systemic lupus erythematosus (SLE). Surprisingly, the Linköping models not only matched the performance of existing models but, in some cases, outperformed them.

David Martínez, a PhD student at Linköping University, emphasized, “Our models not only enable us to classify individuals based on their epigenetic data. We found that our models can identify previously known epigenetic markers used in other models, but also new markers associated with the condition we’re examining. One example of this is that our model for smoking identifies markers associated with respiratory diseases, such as lung cancer, and DNA damage.”

The primary objective of autoencoder models is to compress highly intricate biological data into a representation that captures the most pertinent characteristics and patterns within the data. Remarkably, the researchers allowed the data to guide the AI’s learning process, avoiding any preconceived notions based on existing biological knowledge.

Mika Gustafsson, professor of translational bioinformatics at Linköping University and the leader of this groundbreaking study, explained, “We didn’t steer the model and had no hypotheses based on existing biological knowledge, but let the data speak for itself. When subsequently looking at what was happening in the autoencoder, we saw that data self-organized in a way similar to how it works in the body.”

Moving forward, the researchers intend to utilize the most critical insights obtained from the autoencoder models to construct models capable of classifying a broad spectrum of environment-related, individual-specific factors. This innovative approach addresses situations where there are insufficient training data to support more complex AI models.

While certain AI systems are often seen as “black boxes,” delivering answers without revealing their underlying logic, Gustafsson and his team are committed to developing interpretable AI models. Their vision is to allow researchers to delve beneath the surface of the “black box” and gain a deeper understanding of the biology underpinning various health conditions.

As Mika Gustafsson eloquently puts it, “We want to be able to understand what the model shows us about the biology behind disease and other conditions. Then we’ll see not only whether someone is ill or not, but, by interpreting data, we’ll also have a chance to learn why.”  In this light, the AI-driven models emerging from Linköping University represent an exhilarating frontier in healthcare, promising groundbreaking strides in the pursuit of personalized medicine and scientific discovery.

Source: David Martínez-Enguita, et al. NCAE: data-driven representations using a deep network-coherent DNA methylation autoencoder identify robust disease and risk factor signatures. Briefings in Bioinformatics, September 21, 2023.

Reference: Karin Söderlund Leifler. A step towards AI-based precision medicine. Linköping University News. October 11, 2023.

Related Articles


If you like reading our articles…

Join our e-newsletter! Stay up-to-date with our weekly posts on epigenetics and health, nutrition, exercise, and more.