A new technique has been developed that is expected to advance our knowledge of numerous underlying biological processes, including those implicated in complex diseases like cancer. Using machine learning —a form of artificial intelligence— scientists can predict gene regulation at the cellular level – a process that, before now, has been nearly impossible to do accurately.
An important goal in epigenetic research is to identify regions in the genome that are vulnerable to molecular factors that can alter gene expression without modifying the underlying DNA code. This could include mechanisms like DNA methylation, histone modifications, non-coding RNA expression, or chromatin structural changes. Because these mechanisms regulate many vital biological processes, such as those controlling cell division and differentiation, understanding them holds immense promise for prospective medical applications.
Over the years, researchers have improved upon their understanding of epigenetic modifications and the computational approaches needed to identify these changes and where they take place along the genome. However, extracting the data and analyzing it has been time-consuming and expensive. Plus, most techniques are developed to assess genome-wide binding profiles within a cell population rather than at the single-cell level. Sparsity and noise constraints within the datasets have made it challenging to study transcription factor (TF) binding within an individual cell.
However, scientists at the University of California, Irvine, have overcome these challenges by drawing upon their expertise across multiple departments. In a recent study made available in Science Advances, a team of collaborators has developed a deep-learning framework based on artificial neural networks to predict TF binding at the single-cell level. Called, single-cell factor analysis network, or scFAN for short, this pipeline consists of a “pre-trained model” instructed on mass amounts of genomic and epigenetic data, which can forecast TF binding at the cellular level. scFAN incorporates DNA sequence and ChIP-sequencing (ChIP-seq) data, aggregated similar single-cell ATAC-Seq (scATAC-Seq) data, and mapability data.
“The breakthrough was in realizing that we could leverage deep learning and massive datasets of tissue-level TF binding profiles to understand how TFs regulate target genes in individual cells through specific signals,” said co-author Xiaohui Xie, UCI professor of computer science.
While almost every cell in the body has the same genomic sequence, not every tissue is the same. The variation among cells, or different phenotypes, arises from a specific subset of instructions largely controlled by transcriptional regulatory pathways. Transcription factors (TF) and other proteins orchestrate this vital process and facilitate whether a gene is turned on or off by binding to nearby DNA or RNA. In general, TFs allow cells to perform their operations in proper sequence, bringing together various sources of information to decide how and when a gene is expressed.
Co-senior author and UCI Chancellor’s Professor of Mathematics, Qing Nie, believes that having the ability to predict whether TPs bind to DNA in a specific cell or cell type and at what interval “provides a new way to tease out small populations of cells that could be critical to understanding and treating diseases.”
“At the bulk level, we found that scFAN can predict TF binding motifs more accurately than other deep learning models,” wrote the researchers. “At the single-cell level, scFAN robustly identifies cellular identities, even in cells that are genetically similar.” They also believe that using scFAN allows for more accurate identification of distinct cell types at the chromatin accessibility level, and it can deal with batch effects across multiple samples.
It is becoming more evident that deep-learning technologies like scFAN can greatly improve our understanding of previously unknown domains, especially when it comes to epigenetic research. But these techniques are still limited by the quality and quantity of data available. The current study had some shortcomings in the area of data compilation and chose just three cell lines for their model. Still, this tool’s potential is highly promising, especially since more TF-related data can be incorporated into the scFAN model, increasing prediction results.
Qing Nie mentioned that scientists could use this new deep-learning method to identify key signals in small cell populations that are notoriously difficult to quantify or target in treatment, such as cancer stem cells. He also added, “This interdisciplinary project is a prime example of how researchers with different areas of expertise can work together to solve complex biological questions through machine-learning techniques.”
Source: Laiyi F. et al. (2020). Predicting transcription factor binding in single cells through deep learning. Science Advances. 6(51).
Reference: Bell B. (2021). UCI Researchers use deep learning to identify gene regulation at single-cell level. University of California – Irvine.,