Current Research Projects
With few exceptions each cell in an organism contains a complete set of chromosomes. However, at a given time point only a subset of all genes is expressed. According to the central dogma of molecular biology, the first step in gene expression is the process of transcription of the DNA into mRNA at a specific time point and in a specific cell type. mRNAs of protein-coding genes are subsequently translated into protein, while non-coding RNAs are mainly used for regulatory processes. Gene expression is a highly complex and precisely regulated process that allows a cell to react dynamically to changes of the environment or to its own changed needs. It is used as a mechanism to control which genes of a cell are expressed (so-called `on/off’ switch), as well as the level of expression (so-called `volume control’).
The understanding and analysis of the basic principles of gene expression and moreover gene regulation is still one of the open and unsolved problems in biology.
We develop and apply algorithms and tools for the analysis and visualization of large-scale expression data.
Mayday is a powerful workbench for visualization, analysis and storage of microarray data or any kind of abundance data. It offers a plethora of plugins, such as more than 20 different interactive visualization methods, a large collection of hierarchical and partitioning clustering methods, a variety of statistical tests and a powerful meta information processing platform. Furthermore, several recently developed plugins allow Mayday to process further types of data, such as NGS, GWAS or eQTL data. … more
- Mayday SeaSight
The Mayday SeaSight extension of Mayday allows the integration of data from different platforms, such as microarray or next-generation sequencing technologies. It provides a graphical user interface offering a flexible and fully controllable approach to combine background correction, normalization and expression value computation from heterogeneous data. The resulting expression and/or abundance matrix is directly imported into Mayday and can be further analysed by Mayday’s wealth of methods. … more
Computational Paleogenetics: Ancient genomics
The possibility of retrieving ancient DNA (aDNA) from old specimen and in some cases even extinct species, provides new means in the research of evolution. However these analysis types are generally similar to modern DNA protocols, there exist several specific characteristics of aDNA that render the analysis more cumbersome. Among others, the very limited amount of endogenous DNA, short DNA fragment lengths and the DNA damage patterns due to DNA degradation hamper a successful analysis of aDNA. Additionally to these obstructions, similar to modern DNA analysis, there is also a lack of automatic analysis pipelines that integrate specifically tailored aDNA of methods.
In 2013 our group has started the work on EAGER, short for efficient ancient genome reconstruction, to offer researchers an easy to use pipeline to process aDNA with state of the art methods.
Since then, methods and tools have been added upon availability, making the pipeline now widely adopted among the field. The pipeline offers methods to preprocess, map and subsequently genotype ancient samples, additionally providing a graphical user interface for easier user interaction. Furthermore, we provide new approaches e.g. a Docker container of the whole pipeline, to enable researchers to install and run the pipeline conveniently with integrated updates on their own infrastructure without the need for complex software configuration and installation instructions. A comprehensive documentation that covers most aspects of the pipeline usage – with user videos on typical use cases – provides users with additional help for running the software.
In parallel to our efforts on making EAGER a useful tool, we furthermore worked on novel methods complementing the pipeline.
One feature of the EAGER pipeline is the need of a closely related reference genome. Mapping approaches are not able to identify genomic rearrangements or the deletion of whole genes. Furthermore, most currently published reference genomes are based on modern individuals and it is unknown how similar they are to the respective ancient individuals. Thus if possible de novo assembly should be conducted. To overcome inherent problems with ancient DNA, such as varying read lengths after the preprocessing, we developed MADAM. MADAM uses a two-layer assembly approach, where multiple assemblies are generated in the first layer, which are then combined in the second layer. This approach is able to improve assemblies of ancient DNA with regard to the contig length and other assembly statistics. This in turn can make it possible to identify larger genomic rearrangements or deletions.
Pan-genomes offer a framework to assess the genomic diversity of a given collection of genomes and moreover they help to consolidate gene predictions and annotations. As large-scale genome projects grant access to thousands of individual genomes, the need of efficient algorithms and data structures becomes more and more relevant.
The aim of this project focus on the research and development of three aspects: Pan-genome computation, evaluation and interactive visualization.
With PanGee we present a multiple alignment-based pan-genome computation approach. The main feature of PanGee is the usage of our SuperGenome data structure. This enables a fast and efficient coordinate mapping for the detection of orthologous genome annotation in a multiple alignment. Currently, we are working on the bottleneck of this approach, the time intensive computation of the multiple genome alignment. Here, our aim is the further development of our data structure to efficiently align hundreds of genomes.
The analysis of the pan-genome addresses the question about the genetic diversity of the respective organism. A crucial factor for this analysis is the number of available genomes. Here, the terms of open or closed pan-genome has been coined, based on the organism’s capacity to acquire exogenous DNA. In open pan-genomes, the number of pan-genes, i.e. orthologous groups, within the pan-genome will increase with increasing individuals, while for closed pan-genomes this number converges eventually. With our tool PanMetrics we efficiently compute the growth behaviour of core, dispensable and strain specific genes with the increasing number of observed individuals.
With PanVis we present an interactive visualization tool for pan-genomes. The aim is to connect the different topics of this project to gain a deeper understanding of the data. It contains our previously published tool Pan-Tetris, which visualizes the absence and presence of orthologous genes in the individual genomes of the pan-genome and features a manual curation of the data. In addition, PanVis extends the utility of Pan-Tetris by connecting it with a visualization of the SuperGenome data structure. This enables the exploration of the underlying multiple alignment.PanVis will soon also feature the creation of graphics in publication ready quality, which includes the visualization of the above mentioned PanMetrics.
Visual Analytics of life science data
SuperGenome, GWAS, eQTL, expression data.
With GenomeRing we present two complementary approaches for the quick and comprehensive visualization of all important genomic variations together with various supplemental data. Our SuperGenome concept allows for the computation of a common coordinate system for all genomes in a multiple alignment. This enables the consistent placement of genome annotations in the presence of insertions, deletions, and rearrangements. The SuperGenome concept is utilized by our GenomeRing visualization which, based on the SuperGenome, creates an interactive plot of the multiple genome alignment in a circular layout. … more
With iHAT we present a visual analytics system for genome-wide association studies (GWAS). iHAT supports the visualization of multiple sequence data, associated metadata and hierarchical clustering. It offers a methodology for the visual assessment of SNPs using interactive hierarchical aggregation techniques combined with methods known from traditional sequence browsers and cluster heat maps to help users detect correlations between samples and associated metadata. iHAT is integrated into our visual GWAS and eQTL analytics tool Reveal. … more
Reveal is a visual analytics approach for the exploration, visualization and analysis of expression quantitative trait loci) (eQTL) data. It offers several graph-based as well as matrix-based visualizations of associations between SNPs and gene expression. Although Reveal was designed for eQTL data, it also allows the exploration of GWAS data through the integration of the iHAT visualization system and several other traditional visualizations, such as Manhattan plots, and statistical methods as well as SNP effect prediction methods. … more
SpRay supports the visualization of high dimensional data, such as microarray data, using parallel coordinates and information visualization methods like feature plot, scatterplot matrix, table view, table lens and link-constraint matrix. SpRay is developed as a scalable visual analytics framework. It combines novel visual exploration, such as the loading maps visualizing the result of a PCA, and interaction methods with advanced statistical computing to extract the relevant information from potentially huge datasets generated by high-throughput methods. Different solutions for interactive data analysis are offered: semi-supervised clustering algorithms can help to detect patterns in the data, different linear and non-linear dimension reduction algorithms can be used to identify signals, visual clutter can be reduced with motion as mapping techniques. SpRay was developed within “VAExpress”, a project of the SPP1335 “Scalable Visual Analytics” funded by the DFG, in collaboration with Dirk Bartz from the University of Leipzig, who tragically died in 2010. … more
Small non-coding RNAs in Bacteria
Classification of ncRNAs, transcriptional features, interaction, antisense RNA, transcription start site prediction.
With nocoRNAc we provide a program for the prediction and characterization of ncRNA transcripts in bacteria, which is able to operate solely on the genomic sequence of the target organism. To accomplish this the genome is annotated with transcriptional features such as promoter regions and transcription terminators. Candidate regions are then further analysed with respect to structural conservation and characterized according to their location relative to protein-coding genes (for example to characterize antisense genes). … more
RNA deep-sequencing technologies (RNA-seq) are able to provide detailed insights into the transcriptomic structures of eukaryotic and prokaryotic organisms. This includes the genome-wide detection of transcription start sites (TSS).
Manual annotation of TSS is laborious and time intensive and becomes infeasible when comparing different species.
We therefore developed TSSpredator, a software for the automated detection and classification of TSS from RNA-seq data.
For the comparison of different organisms we designed the SuperGenome approach which generates a common coordinate system for the compared genomes, allowing the comparative annotation of TSS across several species. … more
Current Cooperation projects
The genomic landscape of syphilis
Collaboration with Natasha Arora (University of Zürich), Verena Schünemann, Sascha Knauf (DPZ Göttingen), Sebastian Calvignac (RKI Berlin), Johannes Krause (MPI Jena)
Syphilis and yaws are two diseases caused by the bacteria Treponema pallidum pallidum and/or Treponema pallidum pertenue. However, using serological tests, they cannot be distinguished. The disease causing bacteria are very hard to culture, thus few genomes from clinical samples so far have been deciphered.
In this collaboration project, DNA capture and next generation sequencing methods have been used to generate genomic data directly from clinical samples of infected individuals. These genomes have been compared to other, known genomes. The goal of this collaboration project is to understand the genomic adaptation of these bacteria to antibiotics. Phylogenetic analyses show that a new globally dominant cluster diversified in the mid-20th century.
In the future we will continue to analyze more samples from geographic locations all over the world as well as develop further methods to study the genetic details of each clade.
In the monkey syphilis project together with researchers from the DPZ in Göttingen, RKI in Berlin and David Smajs (Masaryk University, Brno, Czech Republic) we have started to investigate the question about the origin of the yaws disease.
Comparative transcriptomics of multiple bacterial strains
Cynthia Sharma (Research Center for Infectious Diseases, University of Würzburg)
Helicobacter pylori and Campylobacter jejuni represent two major pathogens populating the human stomach. The aim of this collaboration project is to perform comparative transcriptomic analyses of different bacterial strains grown under various conditions to elucidate the regulation of mechanisms involved in pathogenicity. In this context, we apply our automated TSS prediction approach in combination with the SuperGenome for the cross-genome prediction of TSS from RNA-seq data. Based on this, followup analyses such as promoter sequence analyses and SNP detection are performed, which can help to explain differences of the transcriptional architectures and thereby of pathogenicity.
Publication within this project:
Dugar G, Herbig A, Förstner KU, Heidrich N, Reinhardt R, Nieselt K, Sharma CM.
High-resolution transcriptome maps reveal strain-specific regulatory features of multiple Campylobacter jejuni isolates. PLoS Genet 9(5):e1003495.
Johannes Krause (Institute for Archaeological Sciences: Paleogenetics, University of Tübingen)
To decipher the genetic information from human remains has become possible because of the novel techniques of high-throughput DNA sequencing and targeted DNA enrichment. In the group of Johannes Krause these techniques have been established and are applied to study ancient DNA with a focus on pathogens to obtain insights into the evolution of historical diseases, such as pest or leprosy.
Within the leprosy project we have also been able to do a de novo assembly of DNA extracted from a 1100 year old human body, whereby we achieved an almost complete recreation of the genome of the leprosy strain present in the body.
Furthermore we have contributed to the design of a microarray interrogating more than 100 pathogens, which will be used for screening and enrichment of DNA extracted from hominid remains whose causes of death are unknown.
Additionally, we are investigating additional samples to identify the spread and origin of Leprosy in Europe.
Publications within this project:
Bos KI, Stevens P, Nieselt K, Poinar HN, DeWitte SN, Krause J.
Yersinia pestis: New Evidence for an Old Infection. PLOS ONE 2012, 7(11):e49803.
Schünemann VJ, Singh P, Mendum TA, Krause-Kyora B, Jäger G, Bos KI, Herbig A,…, Nieselt K, Krause J.
Genome-Wide Comparison of Medieval and Modern Mycobacterium leprae. Science 2013, 341(6142):179-183.
The transcriptome landscape of Streptomyces coelicolor
Streptomyces coelicolor is a model organism of the antibiotics producing genus Streptomyces.
The aim of the SysMO STREAM consortium is to model gene regulation in this bacterium on various levels – including transcriptomics, proteomics and metabolomics.
During the course of the project omics data of unprecedented detail have been generated. For example, more than 300 microarray experiments were conducted allowing for the generation and analysis of 10 high resolution time-series expression data sets combining various cultivation conditions and several mutant strains. These data complemented by proteomics and metabolomics analyses provide new insights in the gene regulatory landscape of this important model organism.
- A whole-genome GeneChip for expression profiling in S. coelicolor
Together with the ChipDesign group of Affymetrix, we designed a microarray that interrogates all known transcripts of the chromosome as well as both plasmids of Streptomyces coelicolor.
- Expression profiling
At specified time points during the growth phase of S.coelicolor whole genome expression profiling was conducted. The samples were provided by our project partner in the department of Biotechnology at SINTEF Materials and
Chemistry, Trondheim, Norway, who have established protocols for highly reproducible fermentation conditions resulting in synchronized growth in parallel fermentations.
- Prediction and characterization of non-coding RNAs
For the prediction and characterization of non-coding RNAs in prokaryotic genomes we developed the software tool nocoRNAc, which incorporates various methods for the detection of transcriptional features, structural analysis and RNA-RNA interaction prediction.
- Integration with other omics data
Systems Biology studies aim at elucidating the whole set of processes within a given cell or organism, and their interdependence. Within the SysMO STREAM project, transcriptomic analyses were complemented with analyses of the proteome (project partner in Aberdeen) and the metabolome (project partner in Trondheim).
… morePublication within this project:
Symons S, Zipplies C, Battke F, Nieselt K.
Integrative Systems Biology Visualization with MAYDAY. Journal of Integrative Bioinformatics 2010, 7(3):115. doi:10.2390/biecoll-jib-2010-115.
Transcriptomics of stem cells
Prof. Wilhelm Aicher (Center for Regenerative Medicine, Univ. of Tübingen) and Dr. Melanie Hart (Department of Urology, University Hospital of Tübingen,
In this project stimuli and pathways of myogenic differentiation of human mesenchymal stromal cells are investigated
by analyzing the transcriptome of messenger RNAs.
Currently we are using whole-genome microarrays (Affymetrix GeneChips HGU133plus2.0) interrogating about 39.000 annotated human transcripts, to study the expression differences between mesenchymal stromal cells derived from bone marrow with those of term placenta. Here we are interested for example in identifying differentially expressed genes involved in the regulation of bone metabolism. For this we use our software Mayday and IPA from Ingenuity Systems.
Publication within this project
Ulrich C, Rolauffs B, Abele H, Bonin M, Nieselt K, Hart M, Aicher W.
Low osteogenic differentiation potential of placenta-derived mesenchymal stromal cells correlates with low expression of the transcription factors Runx2 and Twist2. Stem Cells and Development 2013, 22(21):1-14.
The origin of genetic coding
Peter Wills (Department of Physics, University of Auckland, New Zealand)
The emergence of the genetic code is still one of big puzzles of biology. It preceded the emergence of all organisms and it is – more or less – universal.
The mechanism of emergence of the genetic code is still not very well understood. Genetic Coding is a regular mapping from the set of tri-nucleotide codons onto the 20 standard amino acids. The mapping is mediated by the amino-acyl tRNA synthetase (AARS) enzymes, which catalyze the assignment of a particular amino acid to its set of cognate codons. Some indication of the path of evolution of the current system of genetic coding from simpler systems is being sought in the structure of the AARSs. We are interested in the genealogy of the AARSs, which predates even the archaea/bacteria/eukarya split.
The big aim of our research project is to use protein structure prediction of AARS enzymes to understand the evolution of genetic coding, in particular to decipher the path of the decomposition of the genetic code.
The goal is to reconstruct the AARS phylogeny according to the decomposition hypothesis that Peter Wills and I have published in a JTB paper. However, there is a general problem when reconstructing the evolutionary history of the AARS in order to resolve the putative process of the evolution of the genetic code. All methods to reconstruct phylogenies use one common model for the substitution of amino acids along the evolutionary tree. This is of course fully appropriate for most applications since the species or certain classes of proteins are historically resolved much later than the final selection of code assignments. However, when reconstructing the evolutionary history of the AARSs themselves, going backwards in time, the sequences of two AARSs coalesce at each branch-point.
Prior to a branch-point between two AARSs with differentiated specificities, two amino acids cannot be functionally differentiated. Thus we suggest rather than choosing one common substitution matrix for the whole phylogenetic tree, different substitution matrices should be used, appropriate to each epoch of the tree, where an epoch is a new branching point during the specification process. Together with David Bryant from the University of Otago, Dunedin, New Zealand, and Remco Bouckaert from the University of Auckland, New Zealand, we are currently implementing and applying an adapted version of Felsenstein’s maximum likelihood method for the computation of the AARS evolution.
Given a probabilistic model of evolution, Felsenstein has developed an efficient algorithm that computes the likelihood of obtaining the given sequences evolving on a given phylogenetic tree topology. We use a first order Markov process to model changes in amino acid sequences of proteins. However, rather than using the same probabilistic model along the tree, and in particular one common rate matrix for all 20 amino acids, at each branching point in the phylogenetic tree of the AARSs a substitution matrix is computed reflecting directly the differentiation process of two AARSs for that node. We thus aim to compute a phylogenetic tree of the AARSs that maximizes the likelihood of obtaining the AARS sequences under the probabilistic model of evolution of amino acids that does not make a prior assumption about the mapping from codons to amino acids, but represents a stepwise differentiation process.
Publications within this project:
Markowitz S, Drummond A, Nieselt K, Wills PR.
Simulation model of Prebiotic Evolution of Genetic Coding. ALIFEX – Tenth International Conference on the Simulation and Synthesis of Living Systems.
Nieselt-Struwe K, Wills PR.
The Emergence of Genetic Coding in Physical Systems. Journal of Theoretical Biology 1997, 187:1-14.
Other publications of Peter Wills achieved during his research visits in Tübingen:
Frameshifted Prion Proteins as Pathological Agents: Quantitative Considerations. Journal of Theoretical Biology 2013, 325:52–61.
Alternative prion proteins. FASEB J. 2012, 26:3100-3101.
Wills PR, Williams DLF, Trussell D, Mann R.
Harnessing our very life. Artificial Life 2011, 19(3-4):451-469.
Pathogenomics of Staphylococci
Friedrich Götz (Microbial Genetics, University of Tübingen), Ralph Bertram (Microbial Genetics, University of Tübingen), Jörg Bernhardt (Microbial Physiology and Molecular Biology, University of Greifswald)
Staphylococcus is a genus of Gram-positive bacteria of which several species are pathogenic for human, most prominently Staphylococcus aureus. To elucidate mechanisms of pathogenicity and drug resistance we aim at the comparison of gene content among different strains. This includes protein-encoding genes but also non-coding RNAs, for example in the context of toxin-antitoxin systems.
Also within the focus of these studies is the detection and analysis of single nucleotide polymorphisms and insertions/deletions and their effect on gene function.
Together with Jörg Bernhardt we currently set up a common gene naming system for Staphylococcus aureus based on our SuperGenome approach.
A toxin-antitoxin system (TA system) is a pair of genes, where the product of one gene functions as a poison and the other genes function is to suppress this toxicity. Usually the antitoxin is less stable than the toxin. If the whole system is lost, e.g., if it is located on a plasmid that is not transferred to the daughter cell, the more stable toxin becomes functional, which potentially kills the cell or sets it to a dormant state.
In type I TA systems the antitoxin is an antisense RNA that binds to the mRNA of the toxic protein inhibiting its translation.
In a collaboration project together with the group of Ralph Bertram we computationally identified several non-coding RNAs in Staphylococcus equorum which putatively act as antisense RNAs in a type I TA system. With further in silico analyses we assessed their structural conservation as well as their RNA-RNA interaction potential with their target mRNAs.
Publication within the project:
Schuster CF, Park JH, Prax M, Herbig A, Nieselt K, Rosenstein R, Inouye M, Bertram R.
Characterization of a mazEF toxin-antitoxin homologue from Staphylococcus equorum. J Bacteriol 2013, 195(1):115-25.
Steffen Hüttner (HB Technologies), Michael Bonin (Microarray Facility Tübingen)
State-of-the-art RNA-seq protocols allows performing gene expression profiling of known genes, annotation of unknown transcripts, differential splicing analysis, variant calling and estimation of allele specific expression. The NGS technologies used for that produce tens of millions of reads , which, in turn, require substantial computing resources for subsequent analyses. One bottle-neck is the mapping step. For this not only powerful compute resources are needed but also a reference genome. PASSAGE, short for ‘Parallel Sequencing Systems for the Analysis of Gene Expression’ is a newly developed experimental protocol and computational methods.
… morePASSAGE extends the idea of SAGE by sequencing reads originating only from well-defined genomic positions. This is achieved by using a specialized library preparation protocol, for which full-length cDNAs are synthesized and digested with RsaI.
We have developed an efficient algorithm that rapidly clusters reads from a common genomic locus and estimates expression levels for the corresponding transcripts in time linear to the number of read sequences. For this it does not need a reference genome, and therefore PASSAGE is an ideal system for high-throughput gene expression studies for non-model organisms.
PASSAGE is supported by the “Zentrales Innovationsprogramm Mittelstand (ZIM)” (AIF) to establish a full-service technology platform together with HB Technologies (Dr. Steffen Hüttner) and MFT (Dr. Michael Bonin).
Publication within this project:
Battke F, Körner S, Hüttner S, Nieselt K.
Efficient sequence clustering for RNA-seq data without a reference genome. German Conference Bioinformatics 2010. Lecture Notes in Informatics. Proceedings of the German Conference on Bioinformatics 2010, Vol P-173, 21-30.
A social network for collaboration projects
Collact is a social network focusing on online collaboration, where users create project(s) and manage them from anywhere, at anytime. Collact helps people to get in contact with their colleagues, project partners, employees, students and run projects together. It offers a clean, simple and user-friendly interface and useful tools such as integrated QR-code generator, BibTeX importer and Twitter topic analyzer (Collact.me).