TüBa-D/W | Universität Tübingen

TüBa-D/W release 0

Introduction

TüBa-D/W is a large treebank of modern written german, that follows common annotations standards and is freely available under a permissive license. The treebank is based on Wikipedia text and consists of 36.1 million sentences (615 million tokens) in CONLL-X format.

Annotations

The following annotations layers are provided:

Part-of-speech tags: STTS
Lemmas: TüBa-D/Z
Morphology: TIGER
Dependency structure: TüBa-D/Z

License/citation

The TüBa-D/W is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported license.

The original Wikipedia text was preprocessed to convert it to plain text.

If you use any part of the TüBa-D/W in your work, please cite:

Daniël de Kok. TüBa-D/W: a large dependency treebank for German, Daniël de Kok, 2014. Proceedings of the 13th International Workshop on Treebanks and Linguistic Theories, Tübingen, Germany, 2014

If you use the morphology annotations, please cite:

Helmut Schmid and Florian Laws. Estimation of Conditional Probabilities with Decision Trees and an Application to Fine-Grained POS Tagging. COLING 2008, Manchester, Great Britain

Acknowledgements

The construction of TüDP-D/W was made possible by the CLARIN-D project. The annotations were performed using WebLicht as a Service, using the following services:

OpenNLP tokenizer and sentence splitter (SfS)
RFTagger Morphology (IMS)
OpenNLP Part-of-Speech tagger (SfS)
MaltParser with morphology (SfS)
SepVerb lemmatizer (SfS)

We would like to thank the Institut für Maschinelle Sprachverarbeitung (IMS), Stuttgart for providing the RFTagger morphology service.

Availability

The treebank can be downloaded in CONLL-X format from the repository of the CLARIN-D center Tübingen:

http://hdl.handle.net/11022/0000-0000-2D62-0