Seminar für Sprachwissenschaft

The corpus TüPP-D/Z

The creation of TüPP-D/Z was funded by the DEREKO project and the Kompetenzzentrum für Text- und Informationstechnologie (KIT), and received additional support from the A1 project of the Sonderforschungsbereich 441.

TüPP-D/Z is a collection of articles from the taz newspaper which have been automatically annotated with clause structure, topological fields, and chunks, in addition to more low level annotation including parts of speech and morphological ambiguity classes. All texts are processed automatically, starting from paragraph, sentence and word form token segmentation. Word forms include information about some regular types of named entities, including dates, telephone numbers, and number/unit combinations. 

The current release of TüPP-D/Z is based on the 1999 HTML distribution (scientific edition) of the taz, which includes newspaper articles from September 2, 1986 to May 7, 1999 and which consists of 11,512,293 sentences (204,425,497 tokens).

A more in-depth description of the linguistic annotation can be found in the partial parsing stylebook, and information about the actual XML encoding of linguistic annotation can be found in the markup guide

 TüPP-D/Z is distributed in XML format. It comes with converters that help you produce e.g. bracketed vertical format.

How to Obtain a License for TüPP-D/Z:

The raw text of 'die tageszeitung' used in the corpus is copyright of contrapress media GmbH, Berlin. Licenses will be granted on a case-by-case basis at the discretion of the copyright holder, and may include charges or restrictions on the data use. Please contact tuebadz-info for more information.


Marie Hinrichs

Eberhard Karls Universität Tübingen
Department of Computational
Linguistics Wilhelmstr. 19
D-72074 Tübingen

Tel.: +49 - (0)7071 - 29 78490
Fax: +49 - (0)7071 - 29 5214