The annotation comprises information on
- inflectional morphology
- syntactic constituency
- grammatical functions
- (complex) named entities incl. semantic classification (organisation, person, location, geo-political entity, and other)
- anaphora and coreference relations
- GermaNet word senses
- dependency relations (automatically created)
- chunk annotation (automatically created)
The syntactic annotation is based on assumptions which are uncontroversial within major syntactic theories. The annotation scheme distinguishes four levels of syntactic constituency:
- the lexical level
- the phrasal level
- the level of topological fields
- the clausal level
The primary ordering principle of a clause is the inventory of topological fields, which characterize the word order regularities among different clause types of German, and which are widely accepted by descriptive linguists of German. In addition to constituent structure, annotated trees contain edge labels between nodes. These edge labels encode grammatical functions (as relation between phrases) and the distinction between heads and non-heads (as phrase-internal relations).
The annotation scheme is surface-oriented in that it relies on a context-free backbone and uses neither crossing branches nor traces. Instead, it describes long-distance relations by specific functional labels.
Sentences in the treebank are enriched with proniminal anaphora (comprised of anaphoric and cataphoric relations) as well as with coreference relations referring to nominal and pronominal antecedents. The linking relations were annotated in PALinkA with markables which were automatically extracted from TüBa-D/Z:
- coreferential relations: 54,382
- anaphoric relations: 50,721
- cataphoric relations: 1,582
- expletives: 7,976
- bound relations: 2,603
- split antecedents: 344
- instances: 289
- inherent reflexives: 9,138
For selected discourse connectives, the instances occurring in the treebank have been annotated with the discourse relation(s) conveyed by the connective instance. Portions of the treebank have been sense-annotated for the connectives nachdem (298 instances), während (531 instances), sobald (28 instances), seitdem (13 instances), als (169 instances), aber (161 instances), and bevor (119 instances). For annotation guidelines see Simon et al. (2011).
Another annotation layer contains structural information as well as implicit discourse relations for a subcorpus of 41 annotated newspaper articles (21,817 tokens) with 1,458 (explicit and implicit) discourse relations. For the annotation schema and numbers on agreement see Gastel et al. (2011).
An extensive description of the complete annotation scheme of syntactic annotation can be found in the stylebook:
- Stylebook 2003
- Stylebook 2005
- Stylebook 2006
- Stylebook 2009
- Stylebook 2012
- Stylebook 2015
- Stylebook 2017
Part-of-Speech tags are annotated with the "Stuttgart-Tübingen-TagSet" (STTS):
The annotation guidelines of anaphora and coreference relations can be found in the following manual: tuebadz-coreference-manual-2007.pdf.
The annotation guidelines of discourse connectives can be found in the following manual: tuebadz-Konnektorenhandbuch_A3_v1.1.pdf.
Please see the Release README for a summary of the formats available and the annotations included in each format.
Funding for the treebank TüBa-D/Z has come from a variety of sources:
- the Competence Center for Text- and Information Technology (Kompetenzzentrum für Text- und Informationstechnologie – KIT)) grant by the Ministry of Science, Research and the Arts Baden-Württemberg (funding since 2000);
- the collaborative research center (Sonderforschungsbereich) grant SFB 441 – Linguistic Data Structures, project A1 – Representation and Automatic Acquisition of Linguistic Data funded by the German Research Council (Deutsche Forschungsgemeinschaft – DFG);
- the collaborative research center (Sonderforschungsbereich) grant SFB 833 – The construction of meaning - the dynamics and adaptivity of linguistic structures, project A3 – Disambiguating Discourse Connectives using Corpus-induced Semantic Relations funded by the German Research Council (Deutsche Forschungsgemeinschaft – DFG);
- the ESFRI research infrastructure project grants D-SPIN and CLARIN-D funded by the Federal Ministry of Education and Research (BMBF) (funding since 2008).
How to Obtain a License for TüBa-D/Z:
For academic research, the license is provided free of charge. For all other uses please contact Erhard Hinrichs for further details.
Please note that we do not give licenses to individuals.
Students who are interested in using TüBa-D/Z for a research project or a thesis project should contact their advisors to obtain a licence for their academic institutions. The license agreement has to be signed by a duly authorized person.
For an academic research license, follow these steps:
- Print the License agreement for TüBa-D/Z (PDF).
- Fill out the license agreement and send it back via post, fax or scan to tuebadz-info. Please give a short description of the intended academic research use.
- After processing the license, we will send you a password for the download webpage.
- Download the treebank.
Eberhard Karls University of Tübingen
Department of Computational Linguistics
D-72074 Tübingen, Germany
Fax: +49 - (0)7071 - 29 5214