Seminar für Sprachwissenschaft

Corpora

Stuttgart-Tübingen Tagset - STTS

Stuttgart-Tübingen Tagset - STTS

  The Stuttgart-Tübingen Tagset consists of a set of 54 part-of-speech tags for annotating German text corpora with word class information. It has become a standard for annotating POS tags in German.

TüBa-D/Z

Tübingen Treebank of Written German - TüBa-D/Z

The Tübingen Treebank of Written German (TüBa-D/Z) is a syntactically annotated newspaper corpus based on data of they daily newspapyer "die tageszeitung". The syntactic annotation was performed manually. 

TüPP-D/Z

Tübingen's Partially Parsed Corpus of Written German - TüPP-D/Z

TüPP-D/Z is a collection of articles from the daily newspaper, "die tageszeitung", which have been automatically annotated with clause structure, topological fields, and chunks, in addition to more low level annotation including parts of speech and morphological ambiguity classes.

The TüPP-D/Z data of the current release is taken from the 1999 HTML distribution (scientific edition) of the "tageszeitung", which includes newspaper articles from September 2, 1986 up to May 7, 1999 and which amounts to more than 200 million word tokens of text.

WebCAGe

Web-Harvested Corpus Annotated with GermaNet Senses - WebCAGe

WebCAGe (short for: Web-Harvested Corpus Annotated with GermaNet Senses) is a domain-independent web-harvested corpus that has been semi-automatically annotated with senses from the German wordnet GermaNet. In order to assure good quality, all automatic annotations have been manually verified. 

Index Thomisticus Treebank

 

Index Thomisticus Treebank

The Index Thomisticus Treebank is a syntactically annotated corpus of works by Thomas Aquinas. It is a dependency treebank of Latin texts containing 170,030 tokens in a total of 9,497 syntactically parsed and tagged sentences from three of Thomas Aquinas' works.

TüBa-D/W

TüBa-D/W is a large treebank of modern written german, that follows common annotations standards and is freely available under a permissive license. The treebank is based on Wikipedia text and consists of 36.1 million sentences (615 million tokens) in CONLL-X format.