Seminar für Sprachwissenschaft

TüBa-D/Z Release 11.0 (06/2018) [Final Release]

The TüBa-D/Z treebank is a syntactically annotated German newspaper corpus based on data taken from the daily issues of 'die tageszeitung' (taz). The treebank comprises 3,816 articles (104,787 sentences; 1,959,474 tokens). The annotations were performed manually.

This final release is dedicated to and in memory of Dr. Heike Telljohann. The high quality of the treebank is largely owed to her commitment to the project, diligence, and attention to detail over many years.

What's new in Release 11.0?

In Release 11.0, an additional 172 articles (9,192 sentences; 171,673 tokens) have been annotated. 

In addition to the previously released formats, this release also contains the treebank in an automatically converted CoNLL-U (v2) format. This will be the final release, although we would like to do manual corrections of the CoNLL-U trees if possible.

The Stylebook has been updated. 

Also included (since Release 9.1) are 17,910 manual annotations of a selected set of lemmas (30 nouns, 79 verbs) with their corresponding senses in the German wordnet GermaNet with the goal of providing a gold standard for word sense disambiguation. See the word sense annotation page for more information. 

View and Search:

Browse and search the TüBa-D/Z treebank using the TüNDRA treebank search web application. Institutional login or CLARIN account is required.

 

Annotation layers:

The annotation comprises information on

  • inflectional morphology
  • lemmas
  • syntactic constituency
  • grammatical functions
  • (complex) named entities incl. semantic classification (organisationpersonlocationgeo-political entity, and other)
  • anaphora and coreference relations
  • GermaNet word senses
  • dependency relations (automatically created)
  • chunk annotation (automatically created)

 

The syntactic annotation is based on assumptions which are uncontroversial within major syntactic theories. The annotation scheme distinguishes four levels of syntactic constituency: 

  • the lexical level
  • the phrasal level
  • the level of topological fields
  • the clausal level

 

The primary ordering principle of a clause is the inventory of topological fields, which characterize the word order regularities among different clause types of German, and which are widely accepted by descriptive linguists of German. In addition to constituent structure, annotated trees contain edge labels between nodes. These edge labels encode grammatical functions (as relation between phrases) and the distinction between heads and non-heads (as phrase-internal relations).

 

The annotation scheme is surface-oriented in that it relies on a context-free backbone and uses neither crossing branches nor traces. Instead, it describes long-distance relations by specific functional labels.

 

Sentences in the treebank are enriched with proniminal anaphora (comprised of anaphoric and cataphoric relations) as well as with coreference relations referring to nominal and pronominal antecedents. The linking relations were annotated in PALinkA with markables which were automatically extracted from TüBa-D/Z:

  • coreferential relations: 54,382
  • anaphoric relations: 50,721
  • cataphoric relations: 1,582
  • expletives: 7,976
  • bound relations: 2,603
  • split antecedents: 344
  • instances: 289
  • inherent reflexives: 9,138

 

For selected discourse connectives, the instances occurring in the treebank have been annotated with the discourse relation(s) conveyed by the connective instance. Portions of the treebank have been sense-annotated for the connectives  nachdem (298 instances), während (531 instances), sobald (28 instances), seitdem (13 instances), als (169 instances),  aber (161 instances), and bevor (119 instances). For annotation guidelines see Simon et al. (2011).

 

Another annotation layer contains structural information as well as implicit discourse relations for a subcorpus of 41 annotated newspaper articles (21,817 tokens) with 1,458 (explicit and implicit) discourse relations. For the annotation schema and numbers on agreement see Gastel et al. (2011).

An extensive description of the complete annotation scheme of syntactic annotation can be found in the stylebook:

Part-of-Speech tags are annotated with the "Stuttgart-Tübingen-TagSet" (STTS):

 

The annotation guidelines of anaphora and coreference relations can be found in the following manual: tuebadz-coreference-manual-2007.pdf.

The annotation guidelines of discourse connectives can be found in the following manual:  tuebadz-Konnektorenhandbuch_A3_v1.1.pdf.

Data Formats:

Please see the Release README for a summary of the formats available and the annotations included in each format. 

Funding for the treebank TüBa-D/Z has come from a variety of sources:

 

How to Obtain a License for TüBa-D/Z:

For academic research, the license is provided free of charge. For all other uses please contact Erhard Hinrichs for further details.

Please note that we do not give licenses to individuals. 
Students who are interested in using TüBa-D/Z for a research project or a thesis project should contact their advisors to obtain a licence for their academic institutions. The license agreement has to be signed by a duly authorized person.  

For an academic research license, follow these steps:

  1. Print the License agreement for TüBa-D/Z (PDF).
  2. Fill out the license agreement and send it back via post, fax or scan to tuebadz-info. Please give a short description of the intended academic research use. 
  3. After processing the license, we will send you a password for the download webpage.
  4. Download the treebank.

 

Contact:

Marie Hinrichs

Eberhard Karls University of Tübingen
Department of Computational Linguistics
Wilhelmstr. 19
D-72074 Tübingen, Germany

Fax: +49 - (0)7071 - 29 5214