Algorithms in Bioinformatics

Master Theses

Here is a listing of possible topics for projects. Further suggestions are welcome.



  • Phylogenetic analysis of phosophate-accummulation genes for short-read and long-read assembled genomes for waste-water sludge - AVIALABLE - co-advisor: Dr. Rohan Williams (NUS)
  • Using profile-HMMs to identify protein families/domains on raw error-prone long-reads - AVAILABLE
    • Profile-HMMs are used in the context of finding proteins, protein families, or shared domains on proteins from sequence data. However, they are designed to work with perfect data, which fails when used on error-prone long-reads or long-read assemblies. The aim of this thesis is to develop HMM based methods that can specifically handle insertion and deletion errors in error-prone long-reads.
  • Identifying the antibiotic resistance potential of healthy human hosts  - AVAILABLE 
    • Different approaches have been used in understanding the antibiotics resistance potential microbes associated with different disease states but very few studies have focused to understand it in the healthy human host. we will aim to compare the available tools to find the best-suited approach for the task. And further will explore the resistance potential of microbiota residing in a healthy human.
  • Metagenome-wide association studies (MWAS) of plant growth-promoting traits (PGPT)  - AVAILABLE
    • This project is to investigate the distribution of PGPTs (genes/proteins) in various microbiomes. Taxonomic and functional diversity will be studied regarding their significant differences to plant-associated metagenomes and their taxonomic affiliations, using e.g. DIAMOND and MEGAN.
  • Sexes in the rock pools: molecular basis and evolution of male versus female differentation in brown algae- AVAILABLE - Co-advisor: Dr. Susana Coelho, Director MPI for Developmental Biology
    • The brown algae are a eukaryotic supergroup that has been evolving independently of animals and land plants for more than a billion years. During that time, they acquired multicellularity to become the third most developmentally complex lineage on the planet, rivalling land plants in terms of body size and complexity. The Coelho lab has recently identified several major developmental regulators and dissected the chromosomal basis of sex determination in this group (e.g. Ahmed et al., 2014; Cock et al., 2010; Coelho et al., 2018), providing a solid foundation for the future development of brown algal developmental biology and comparative molecular biology. Research currently focuses on the origin, evolution and regulation of sexual systems diversity and on the molecular and evolutionary mechanisms that underlie the complex developmental patterns and reproductive features in the brown algae. RNAseq datasets for male and females across several species of brown algae, and the master project involves producing more data from cultivated algae, analyzing and interpreting datasets using bioinformatic and molecular evolution approaches. 
  • In silico analysis of the evolution and function of ABC transporters in bacterial secondary metabolite gene clusters  - AVAILABLE - Co-advisor: Prof. Nadine Ziemert (Biology Tübingen)
  • RNAseq and Iso-seq data analysis - Paul Epperlein - Co-advisor: Dr. Susana Coelho, Director MPI for Developmental Biology
  • Prediction of genes in genomes with frequent (~10%) translational frameshifting (i.e. Euplotes - - with a fragmented genome like Oxytricha) - Killian Maidhof-  Co-advisor: Dr. Estienne Swart, MPI for Developmental Biology
  • Assembly of a transcriptome-like genome from PacBio HiFi reads AVAILABLE-  Co-advisor: Dr. Estienne Swart, MPI for Developmental Biology
    • Stylonychia lemnae and other spirotrich ciliates (a clade of microbial eukaryotes), have an unusual, highly fragmented somatic genome architecture composed of “nanochromosomes”: telomere-to-telomere DNA molecules that typically encode one gene each. Analogous to alternative mRNA isoforms, there are occasionally also alternative nanochromosomes (DNA) isoforms, e.g. a one-gene form and a two-gene form sharing a common gene. Previous Stylonychia genome assemblies with approximately 16000 telomere-to-telomere contigs were produced using Illumina sequencing. However, many molecules from such assemblies are still incomplete, i.e. possess one or no telomeres. We obtained deep Pacific Biosciences HiFi sequencing coverage for Stylonychia lemnae, which fully captures many telomere-to-telomere sequences, to address this problem. Unfortunately, conventional long read genome assemblers do not assemble this data well, leading to low recovery of all the expected common eukaryotic orthologs. Thus, there is a need to develop a better assembly approach. Previously, for a related ciliate species, a clustering approach (VSEARCH) selecting a representative centroid was used to retrieve full-length nanochromosome isoforms from earlier, less accurate PacBio sequencing (Lindblad et al. 2019). A hybrid assembly combining Illumina DNA-seq and long-read error correction of the HiFi sequences with DNA-seq was used to improve upon existing assemblies (Lindblad et al. 2019). With more accurate HiFi reads a cleaner approach that predominantly or exclusively relies upon the long reads can be envisaged. One way to achieve this may be to build upon long read clustering software designed for transcriptomes, e.g. isONclust (Sahlin & Medvedev. 2020), tailoring it to the properties of the nanochromosome genome architecture. Processing that identifies and accommodates possible alternative isoforms and telomere addition sites before multiple sequence alignment, and using a consensus building algorithm, e.g. SPOA, will also need to be developed to produce the final contigs.
      References: 1. Aeschlimann et al. 2014:, 2. Lindblad et al. 2019:, 3. Sahlin & Medvedev 2020:, 4. Vaser (Github):



  • Targeted functional annotation of bacterial genomes using DIAMOND and MEGAN in a phylogenomics approach  - done
  • Use assembly graphs in contig binning - develop methods and implement in Java  - done
  • Use of GTDB and AnnoTree for protein-alignment-based microbiome analysis - done
  • qiime2megan- develop a set of tools that allows one to import qiime2 data into MEGAN and vice versa, export MEGAN analyses into qiime2 (Python or Java) -doneDirector MPI for Developmental Biology
  • Prediction of genes in genomes with ambiguous genetic codes (where “stop” codons can be sense or stop, depending on the context): (  - done


  • Visualization and analysis of autocatalytic networks (implementation using JavaFX) - done
  • Protein k-mer methods for microbiome analysis - done
  • Identifying the Role of ALPs in Methanobacteriaceae - done
  • Inclusion of Environmental Data in Machine Learning Models for Genomic Prediction in Rice - done
  • Database analysis of function - done
  • Computational analysis of metagenome data from caprylate producing bioreactors” -done
  • Interpretability of Machine Learning Models for Genomic Selection in Maize” - done
  • Effect of reference genome choice on variant calling”- done
  • Assessment of assembly strategies for bioreactor metagenomics” - done
  •  Improved metagenomic contig binning using haplotagging data - done


  • Analysis of twin study microbiome samples - done
  • Performance of DIAMOND+MEGAN on CAMI data - done
  • Machine learning algorithms applied to protein sequences - done


  • JavaFX implementation of haplotype networks  (SplitsTree5 - part II) - done
  • Evolution and horizontal gene transfer of regulatory elements in bacterial secondary metabolite gene clusters  - done
  • Exploring the use of TPR "Trough to Peak Ratio" analysis to determine which bacteria are growing and which are stagnant in the human gut during a course of antibiotics - done
  • Design and implementation of a full-featured Time Series Analysis tool - done


  • JavaFX implementation of Phylogenetic network drawing and GUI (SplitsTree5 - part I) - done
  • Pathogen identification - done
  • Correspondence between KEGG and InterPro in metagenome analysis - done


  • Assembly of ancient mtDNA genomes - done
  • Real-time monitoring of resistance evolution - done
  • SamSifter - A toolbox for metagenomic analysis - done
  • Fingerprinting of microbial genomes - done


  • Haplotype profile sharing in Arabidopsis thaliana - done
  • Visualization of very large numbers metagenome samples - done



  • Functional analysis of trinucleotide repeats in plants - done
  • Sequencing and assembly strategies for a new plant genome - done



  • Annotation of bacterial genomes - done
  • Reference-guided protein assembly - done
  • Naive Bayesian classifier for metagenomics - done


  • Pathway evaluation in (meta) transcriptomics - done
  • Analysis of 16S data - done
  • Finding confidence interval for multiple metagenome comparison networks - done
  • Assembly and annotation of the Guppy transcriptome- done
  • Correlating taxonomy and gene function with environmental parameters - done
  • Short-Read aligners in Metagenomics -- done



  • Faster BLAST analysis of metagenomic data - done
  • New methods for the comparison of phylogenetic trees and networks- done
  • Hybridization networks - done
  • TE Discovery by Next Gen Sequencing - done

  • Simulation of 3rd generation sequencing technologies - done

  • Analysis of human gut data - done
  • Analysis of 16 S rRNA - done
  • Datenbankgestützte Analyse von Metagenomikdaten - done
  • Functional and pathway analysis of metagenomic data - done

  • Efficient data mining techniques for two-locus association mapping- done
  • Finding Patterns in Intervals - done