Seminar für Sprachwissenschaft

WebCAGe

Web-Harvested Corpus Annotated with GermaNet Senses

WebCAGe (short for: Web-Harvested Corpus Annotated with GermaNet Senses) is a sense-annotated corpus for German, annotated with senses from the German wordnet GermaNet. The corpus is domain-independent. It was automatically harvested with the help of the German Wiktionary, an online dictionary of German, using the dump as of February 1, 2011. In order to assure good quality, all automatic annotations have been manually verified.

WebCAGe was constructed in two steps:

  1. Mapping lexical units of GermaNet to senses in Wiktionary. Please see the following website for more information about this mapping: http://www.sfs.uni-tuebingen.de/GermaNet/wiktionary.shtml
  2. Extracting the example sentences for each Wiktionary sense together with references to source documents using the mapping created in the step 1. (For further information on the semi-automatic corpus creation please see referenced paper below)

 

The corpus consists of the following four major components:

  • Example sentences from Wiktionary: approximately 1-3 example sentences per word sense.
  • Wikipedia articles: annotated Wikipedia articles.
  • Gutenberg texts: snippets of the Gutenberg source documents. These files contain all tagged target word occurrences with a context window of +-5 sentences before and after the sentence containing the tagged target word.
  • External webpages: texts obtained from German newspapers and other German websites that are referenced in Wiktionary example sentences.

For a more detailed explanation of how WebCAGe has been constructed and the significance of the four components, please see the EACL paper referenced below.

Download

WebCAGe consists of four major components; two of them are freely available for download here: 

 

 

Reference

If you use WebCAGe in the context of scientific or research work, please cite the following paper:

Verena Henrich, Erhard Hinrichs, and Tatiana Vodolazova: WebCAGe -- A Web-Harvested Corpus Annotated with GermaNet Senses. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012), Avignon, France, April 2012, pp. 387-396.

[Download paper: http://aclweb.org/anthology/E/E12/E12-1039.pdf]
 

Contact

Eberhard Karls University of Tübingen
Department of Computational Linguistics
Wilhelmstr. 19
D-72074 Tübingen, Germany
Fax: +49 - 7071 - 29 5214