Seminar für Sprachwissenschaft

Preservation Policy

This policy describes the Tübingen CLARIN-D repository with respect to its goal of preserving and distributing language-related digital data to researchers in the Humanities and Social Sciences. The repository is operated by the CLARIN-D Center in Tübingen, located at the General and Computational Linguistics Department ("Seminar für Sprachwissenschaft",SfS) of the University of Tübingen, Germany.

This preservation policy is based on the  DANS preservation policy version 1.2, July 18, 2017, with grateful acknowledgement. 

Introduction and Background

CLARIN is an acronym for "Common Language Resources and Technology Infrastructure". It is a research infrastructure that was initiated from the vision that all digital language resources and tools from all over Europe and beyond are accessible through a single sign-on online environment for the support of researchers in the humanities and social sciences. The CLARIN infrastructure is fully operational in many countries, and a large number of participating centres are offering access services to data, tools and expertise.

In 2012, nine CLARIN member countries created CLARIN-ERIC (European Research Infrastructure Consortium), which is an international legal entity that governs and coordinates CLARIN activities. CLARIN-ERIC members are governments or intergovernmental organisations which pay an annual fee to support the development and maintenance of the CLARIN research infrastructure.

Germany is one of the founding members of CLARIN-ERIC and contributes to CLARIN-ERIC via CLARIN-D. CLARIN-D is an acronym for "Common Language Resources and Technology Infrastructure Deutschland".

The CLARIN-D Resource Center Tübingen is one of currently eight German CLARIN-D Resource and Service Centers which form a web and centers-based research infrastructure for the humanities and social sciences. The aim of CLARIN-D and its service centers is to provide language data, tools and services in an integrated, interoperable and scalable infrastructure for researchers in the humanities and social sciences and related disciplines. The research infrastructure is rolled out in close collaboration with expert scholars in the humanities and social sciences, to ensure that it meets the needs of users in a systematic and easily accessible way. The CLARIN-D Resource Center Tübingen is part of the CLARIN-D consortium funded by the German Federal Ministry for Education and Research.

The CLARIN-D Center in Tübingen is a certified  Type-B CLARIN Center  and adheres to all standards and best practice recommendations of CLARIN. The repository adheres to the recommended practices of the OAIS Reference Model and, as part of the University of Tübingen, it follows the  Guidelines for Safeguarding Good Scientific Practice set forth by the university.

Mission

The mission of the Tübingen CLARIN-D Repository is to ensure the availability and long-term preservation of resources in the field of Humanities and Social Sciences, to preserve the knowledge gained in research, to aid the transfer of knowledge into new contexts, and to integrate new methods and resources into university curricula.

This mission is supported by the infrastructure of the University of Tübingen and by the integration of the repository into the national and international CLARIN infrastructures. As part of the CLARIN-D infrastructure, it shares the  CLARIN-D mission to provide linguistic data, tools and services in an integrated, interoperable and scalable infrastructure for the Humanities and Social Sciences, and is committed to play an active role in the development of CLARIN's repository infrastructure.

The CLARIN-D center in Tübingen supports data from the Humanities and Social Sciences with a clear emphasis on language related material, both for disciplines working with language analysis as the objective of research and as a research method. This covers data especially from Linguistics, Psycholinguistics, Corpus Linguistics, Syntax, Semantics, Lexicography, etc. but also includes other areas such as literary studies, political sciences, history, etc.

For an overview of the mission and goals of the CLARIN research infrastructure, see the following publication by Erhard Hinrichs (national coordinator of CLARIN-D) and Stephen Krauwer (former executive director of CLARIN-ERIC):

Hinrichs, E.; Krauwer, S. (2014a): The CLARIN Research Infrastructure: Resources and Tools for E-Humanities Scholars. In: N. Calzolari et al. (Eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14). 1525–1531. ELRA, Reykjavík, Island.  (pdf)

Scope and Objectives of the Policy

This policy summarizes methods and practices used by the repository to preserve and promote reuse of its datasets. The preservation policy follows best practice guidelines and standards as set forth by CLARIN, OAIS, and the  Core Trust Seal (formerly  Data Seal of Approval), as well as the guidelines for good academic practice established by the University of Tübingen.

The objective of the Tübingen CLARIN-D repository is to preserve and foster reuse of the data it contains. The responsibilities of the repository cover:

  • Ensuring that datasets are stored in formats that are widely-used and accepted by the user community, which may change over time
  • Ensuring the quality and integrity of the datasets
  • Ensuring that the distribution of datasets is carried out in accordance with their licenses 
  • Ensuring that datasets are accompanied with complete and accurate metadata, fostering discoverability
  • Ensuring both physical and digital security of the datasets

Requirements and Legal Framework

Repository Requirements

To ensure that the goals of the repository are met, the repository defines a set of requirements which must be followed:

  • Data is deposited using strict ingest procedures
  • Data must be accompanied by sufficient and accurate metadata, according to the metadata standards set forth by CLARIN
  • Data must be stored in formats that are widely-used and accepted by the user community, which may change over time
  • Metadata is harvestable using the OAI-PMH protocol
  • Data is assigned a persistent identifier (PID), which can be used to view the metadata, download the data, or otherwise refer to the data

Legal Framework

Neither the CLARIN-D resource center nor the repository run by it, are legal entities on their own. This also holds for the General and Computational Linguistics Department ("Seminar für Sprachwissenschaft", SfS) where they are located. All are part of the University of Tübingen which is a legal entity - specifically, like all public German universities, a "Körperschaft des öffentlichen Rechts", an institution governed under public law. Hence, the university as an institution is the contractual party in the agreement between the depositor of the data and the repository (referred to as the  depositor's agreement).

The depositor's agreement states the rights and obligations of both parties. It confirms that the depositor owns all necessary rights required to deposit the data and that they are in compliance with all relevant national and international legal regulations. Data providers grant the repository permission to distribute the data in accordance with the access model chosen (public, academic, or individual), while retaining all intellectual property rights to their data.

An End User License Agreement (EULA) is an agreement between the depositor and the user. Depositors are required to specify the EULA (e.g. Creative Commons) to be applied to the dataset, and the license information is included in the metadata for the dataset. For some resources (e.g. those with individual access), the user may need to sign such an agreement with the depositor before the repository can give access to the resource. In this case, a username and password are provided to the individual which allows them to download the dataset.

The repository will not ingest data that has unclear ownership, unresolved rights issues, or other legal issues. If any legal issues arise after the ingest process, access to the digital object will be blocked (including data download and metadata harvesting) until the issue is resolved. 

 

Roles and Responsibilities

All repository staff assist in implementing this preservation policy as appropriate to their roles and responsibilities.

Depositors do not directly interact with the repository interface, but submit datasets to a data manager.

Data managers create the necessary metadata in collaboration with the depositor, make sure that the data formats follow the long term preservation requirements, inspect the data with regard to quality and possible legal issues, act as a liason between the depositor and the university adminstration in signing the depositor's agreement, and perform the actual ingest of the dataset into the repository.

The repository's technical staff have a deep understanding of the system's components and structure as well as the required processes. Technical staff are responsible for maintaining the accessibility and functionality of the repository, including all hardware and software components. 

Content Coverage

This repository accepts data from the humanities and social sciences with a clear emphasis on language related material, especially from Linguistics, Psycholinguistics, Corpus Linguistics, Syntax, Semantics, Lexicography, etc. but also other areas such as literary studies, political sciences, history, etc.

Depositors are encouraged to use formats listed in the  CLARIN standard recommendations  when possible. The list of accepted data formats may be extended to include new, widely-used formats in the field. In the case that a data format is removed from the list of acceptable formats in the future, every effort will be made to convert datasets into an acceptable format.

Quality checks of data are performed by local experts in the field before it is deposited. In the case where a dataset cannot be reviewed locally, the repository may request assistance from another CLARIN center, seek advice from external advisors, or encourage the depositor to deposit the data at a CLARIN center which is better suited to ingest the dataset.

Currently, the repository does not preserve software that has been used to produce datasets, although depositors are encouraged to document the dataset, including software used in its creation and software that can be used to read or further process the data. 

Implementation of the Preservation Strategy

This section is structured around the main functional concepts of the Open Archival Information System (OAIS) reference model for digital preservation environments. With the use of the Fedora-Commons system and the defined workflow supported by the repository's interface, the repository is compliant to this model. Provisions for the main functional entities described in OAIS are summarized as follows: 

Pre-Ingest

Although the pre-ingest function is not officially part of the OAIS model, it has been the experience of well-established archives that including this step in the repository workflow helps to ensure the usability and accessibility of datasets through the improved quality of metadata and documentation.

In the pre-ingest phase, depositors are informed about legal issues and preferred data formats. They are also informed about the archiving process and what materials and/or information they will need to provide (e.g. depositors agreement and metadata). Local experts inspect data samples provided by the depositor to ascertain its quality and appropriateness for ingestion. The repository may determine that a different CLARIN center would be better suited to evaluate and preserve the dataset. In this case, the repository will refer the depositor to another center.

A depositor’s agreement is signed by both parties which asserts that the depositor owns all necessary rights required to deposit the data, that they are in compliance with all relevant national and international legal regulations, and that they grant the repository permission to distribute the data in accordance with the chosen end user license agreement (EULA) and access model (public, academic, or individual).

Metadata creation may also be part of the pre-ingest phase, but will be finished at the latest in the ingest phase. Depositors, together with the data manager, gather all supplementary data (such as publications) and create extensive CMDI metadata to describe the dataset. 

Ingest

In this phase the data manager acquires the data (SIP in the OAIS model) from the depositor, ensure that the depositor's agreement has been signed, that the metadata is sufficient and accurate, and that the data formats are accepted by the repository policy. The data is subsequently uploaded to the repository following a documented, automatic workflow. Metadata, recorded in a CMDI format (ISO-CD 24622-1) appropriate for the type of resource, is also uploaded as part of the ingest process. The ingestion workflow also requires additional system metadata, namely the type and the access control settings of the newly created resource.

An important part of the creation of the new digital resource is the allocation of a PID, which is used for reliable identification and access of the resource. This unique PID also becomes a part of the digital object's metadata. The ingest process is concluded by forwarding the PID back to the depositor for future reference.

Both the pre-ingest and the initial stages of the ingest processes are carried out in close collaboration with the depositor.

Archival Storage

The repository uses the standard Fedora Commons archival layer, which implements error checking functions and simplifies data recovery. This layer checks the integrity of the stored data through the use of checksums maintained as part of the system metadata. The built-in version control system ensures that the data is never overwritten but instead new versions are created on every update operation.

The repository runs on a server housed, managed and maintained by the University of Tübingen's data center (ZDV), who is also responsible for making daily backups of the data and system configurations to a remote location. An additional backup mechanism is also in place, making weekly backups to a different remote location. In case of disaster, recovery will first be attempted through the ZDV backups, and then through the documented recovery procedures of the alternative backup strategy.

The repository status and availability of resources are continually monitored within the CLARIN-D infrastructure. In case of any failure, the repository staff is notified immediately. 

Data management

The standard Fedora Commons tools, in combination with a customized administrator interface, are used for data management. They can be used to access information about the state of the digital objects and any associated system metadata (e.g. creation dates, access rights, identity of the ingester, etc.), and to perform most standard management tasks. The customized administrator interface also reports any inconsistency in the metadata model, according to preprogrammed rules. The repository includes an RDF store of system metadata and also an audit trail which can be used to inspect past activities.

New versions of a resource can be ingested either: 

  • as an additional data stream in the resource's data object, in which case the PID of the resource will refer to all the versions, with a part identifier distinguishing between versions;
  • or as a completely new data object with a new PID.

External discovery of resources is supported by freely disseminating the metadata via the OAI-PMH protocol, which also supports selective harvesting. Both the OAI-PMH supplied metadata and the Fedora Commons tools are used to identify and fix discoverability issues.

The policy of the repository is to never delete datasets, but in extreme cases the repository may be legally required to do so.

Administration

All necessary day to day administrative tasks are performed by the repository staff, such as: 

  • submission procedures
  • system engineering
  • system monitoring
  • review and update of policies and procedures
  • depositor and end-user support
  • etc.

Preservation Planning

Through regular participation in CLARIN-D activities, the repository is informed of new developments in repository technologies. When new platforms become available they are evaluated in terms of: technology; use of open formats; long term preservation capabilities; and compatibility with established workflows. Migration tests are also conducted to these new platforms.

The open format used by Fedora Commons for its internal serialization enables the long-term accessibility of the data.

The repository requires the usage of specific file formats as recommended by CLARIN. The preferred file formats will change over time, in which case the repository will make every effort to migrate to other formats, while keeping originals intact for reproducibility purposes.

Access

The digital objects are available for reading access via their PID for authorized users, based on the AAI infrastructure of the CLARIN Service Provider Federation and a local user management. The PIDs are recorded in the metadata, which can be harvested via OAI-PMH. One of the more important metadata harvesters is the CLARIN VLO (Virtual Language Observatory), which provides faceted search facilities for all CLARIN repositories. End users can access the data over HTTP, through the PID displayed by the search engines.

The metadata of a resource is always public. However, the datasets are subject to the access rights as set by the depositor in the ingest phase, which can be one of: 

  • public: unrestricted access 
  • academic: access granted to users authenticated by their academic institution
  • individual: after having signed a written license agreement, individuals are provided with access credentials

Integrity and Security

An audit trail is automatically maintained by the repository for all operations on a dataset. The repository verifies the integrity of the stored data through the use of checksums maintained as part of the system metadata. The built-in version control system ensures that the data is never overwritten but instead new versions are created on every update operation.

The repository is committed to taking all necessary precautions to ensure the safety and security of the data it preserves. The servers are managed and maintained by the University of Tübingen's data center (ZDV), who take all necessary precautions to ensure the security and operability of hardware and system software. This includes operating system updates and firewall management, as well as the physical security of the servers. 

Sustainability Plans and Funding

CLARIN-D, and therefore the repository, is funded by Bundesministerium für Bildung und Forschung (BMBF) with project based funding for terms of four years. Additionally, the repository is funded by the University of Tübingen in conjunction with the Seminar für Sprachwissenschaft (SfS).

All CLARIN centers commit to ensuring long-term availability, access and to preservation of datasets submitted to their repositories, as set out in their mission statements. CLARIN centers are setup as a distributed network, where each centre institution is a hub of the digital humanities and brings its own financial resources into CLARIN-D, which ensures continued availability. In the case of a withdrawal of funding, the repositories content would be transferred to another CLARIN centre. The legal aspects of the process of relocating data to another institution is addressed by templates of license agreements provided in CLARIN.

The repository is currently in the process of developing its future strategy, allowing it to guarantee preservation periods to data depositors. Discussions with the BMBF, the state of Baden-Württemberg, and the University of Tübingen are ongoing.