Seminar für Sprachwissenschaft

Repository

Research Data Repository

The data resources of TALAR (Tübingen Archive for Language Resources) at the University of Tübingen comprise corpora for spoken language and written texts, which are annotated at various linguistics levels such as morphology, syntax and semantics. In addition, there are lexical resources that are closely linked to other lexical and textual resources from the Text+ consortium. The corpora and lexical resources are indispensable for data-driven research in both theoretical linguistics and computational linguistics. The annotations include various syntacticl frameworks and adhere to the widely used encoding standards in the community as well as the encoding standards of the International Standards Organization (ISO). These resources are housed in the TALAR data repository, which has been certified with the Core Trust Seal (CTS) and has developed standardized protocols for data ingest from external data resources. TALAR hosts a collection of widely used syntactically annotated corpora, the so-called TüBa treebanks for German, English and Japanese. In addition, TALAR contains a large number of externally developed treebanks as part of the Universal Dependencies. All linguistically annotated corpora of the UniTÜ can be searched and visualized with the web application TüNDRA (Tübingen Annotated Data Retrieval Application) and are also accessible via CLARIN and Text+ Federated Content Search. In addition to the linguistically annotated corpora, UniTÜ offers data services in the form of vector space word representations and associated software tools. Furthermore, UniTü offers software services for the incremental annotation of external text corpora via the virtual research environment WebLicht. Among other things, WebLicht enables the automatic enrichment of text corpora with name-entity recognition based on deep learning tools and can thus be used as a tool for the automatic enrichment of unstructured data and subsequent linking with authority data as well as linked open data. The lexical resources are closely linked to other lexical and textual resources and are interoperable with them. The valency dictionary of German verbs was derived from large text corpora and is therefore linked to corpus data. GermaNet is a lexical database of word meanings for contemporary German nouns, verbs and adjectives, which is directly linked to wordnets of more than fifty languages of the world via an interlingual index. In addition to other wordnets, GermaNet is linked to other digitally created resources such as Wikipedia and Wiktionary. Taken together, they provide a solid basis for assessing lexical similarity and dissimilarity. These two concepts are essential for psycho- and neurolinguistic research as well as for topic modeling and semantic search in a broad spectrum of disciplines ranging from applications in computer science to literary research. Examples include author identification and genre classification as well as semantic searches for dictionary data or large metadata collections. In addition to the academic sector, GermaNet is also in great demand for industrial applications. Furthermore, the linking of word senses via semantic relationships provides an ideal starting point for the conversion of wordnet data into linked open data formats and for easy integration into knowledge graphs. Accordingly, mapping GermaNet to linked open data and knowledge graphs will add significant value to Text+ and provide a direct data bridge to other NFDI consortia.