Seminar für Sprachwissenschaft

Index Thomisticus Treebank

The Index Thomisticus was founded in 1949 by Father Roberto Busa (1913-2011) as one of the first large searchable digital corpora. The project was a ground-breaking early project in computational linguistics and is credited with the development of the first machine-generated concordance, among other firsts in digital corpus development.

The Index contains the 118 works of the Opera Omnia (complete works) of Thomas Aquinas in digital form as well as 61 texts by other authors related to Aquinas or attributed to him, for a total of around 11 million morphologically tagged and lemmatised tokens. It has been available on CD-ROM since 1989, and on the web since 2005 at the Corpus Thomisticum website.

In the early 1970s, Father Busa started to plan a project aimed at both the morphosyntactic disambiguation and syntactic annotation of the Index Thomisticus. Today, both these tasks are being undertaken by the Index Thomisticus Treebank project (IT-Treebank) hosted at the CIRCSE research centre of the Università Cattolica del Sacro Cuore, under the direction of Marco Passarotti.

The Seminar für Sprachwissenschaft is now also hosting the IT-Treebank and is in the process of making it available in multiple formats and available for use with WebLicht.

Treebank Contents

Currently, the IT-Treebank contains 170,030 lemmatised and morphologically analysed tokens in a total of 9,497 syntactically parsed and tagged sentences, in Latin, excerpted from three of the works in the Index Thomisticus:

 

  • Scriptum super Sententiis Magistri Petri Lombardi 

    (Also known as Scriptum super libros Sententiarum and Scriptum super Sententiis.)

  • Summa contra Gentiles 
  •  
  • Summa Theologiæ 
  •  

     

The IT-Treebank's approach to syntactic markup is based primarily on the Prague Dependency Treebank. It is a dependency treebank, which means that the text has been annotated with labeled connections between words or tokens. Dependency grammars are popular for languages with strongly analytical morphology like Latin.

The IT-treebank tagset and annotation follow the PDT Annotation Guidelines.

The IT-treebank is currently only available for distribution in the CoNLL format, with CSTS-SGML (Czech Sentence Tree Structure), PML-XML (Prague Markup Language) and TCF (Text Corpus Format) to come soon.

License

The Index Thomisticus Treebank is the work of the Centro Interdisciplinare di Ricerche per la Computerizzazione dei Segni dell’Espressione (CIRCSE) of Università Cattolica del Sacro Cuore. It is available for use under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Download 

Contact

Scott Martens

Eberhard Karls University of Tübingen
 Department of Computational Linguistics
Wilhelmstr. 19
 D-72074 Tübingen, Germany
 Tel: +49-7071-29-73969
 Fax: +49-7071-29-75214