Algorithms in Bioinformatics


Multiple transformer-based language models for accurate DNA methylation prediction

A MuLan-Methyl web server is provided here.

MuLan-Methyl is a new a deep learning framework for predicting DNA methylation sites, which is based on multiple (five) popular transformer-based language models. The framework identifies methylation sites for three different types of DNA methylation, namely N6-adenine (6mA), N4-cytosine (4mC), and 5-hydroxymethylcytosine (5hmC). Each of the five employed language models is adapted to the task using the "pre-train and fine-tune'" paradigm. Pre-training is performed on a custom corpus consisting of  DNA fragments and taxonomy lineages using self-supervised learning. Fine-tuning then aims at predicting the DNA-methylation status of each type. The five models are used to collectively predict the DNA methylation status. MuLan-Methyl performs very well on a benchmark dataset. Moreover, the model captures characteristic differences between different species that are relevant for methylation. This work demonstrates that language models can be successfully adapted to this domain of application and that joint utilisation of different language models improves model performance.

Wenhuan Zeng, Anupam Gautam, Daniel H. Huson, MuLan-Methyl - Multiple Transformer-based Language Models for Accurate DNA Methylation Prediction, GigaScience, Volume 12, 2023, giad054

A MuLan-Methyl web server is provided here.​​​​​​​