The long-term preservation strategy of The Language Archive consists of two parts: data replication such that it is more likely that the bit-streams will survive in the long run, and the limitation of archival formats such that any conversions in the future in case one of the formats becomes obsolete is more feasible.
In order to prevent data loss in case of technical failures or force majeure, all data in The Language Archive is replicated to number of different locations. The main archival copy resides at the MPI for Psycholinguistics in Nijmegen and is backed up twice on tape. The MPI uses a Hierarchical Storage Management system from Versity Software Inc. (based on SAM-QFS), that automatically stores one copy on hard disk arrays and two copies on LTO-7 magnetic tape. This system keeps MD5 checksums of each archived file in the inode filesystem metadata, which is used to verify the integrity of the bit-streams. An additional backup is stored at the Max Planck Computing and Data Facility (MPCDF, formerly known as RZG) in Garching near Munich, one of the data centers of the Max Planck Society. Another backup is stored at the Göttingen Society for Scientific Data Processing (GWDG) in Göttingen, which is another data center of the Max Planck Society. In total this means that there are at least 5 copies of each file in 3 geographically distinct locations. The bit-stream preservation of the copies that reside at the data centers of the Max Planck Society is guaranteed for 50 years.
The data replication from MPI Nijmegen to MPCDF Garching is done with iRODS and the replication from MPI Nijmegen to GWDG Göttingen is done to an S3-compatible storage service directly from the Versity HSM software.
Archival file formats
The Language Archive only archives a limited set of file formats. These formats are chosen according to the following criteria, which may sometimes conflict with one another:
- openness of the format and/or availability of full specifications
- established standards or de facto standards within the research domain
- assessment of the longevity of the format
- no lossy compression if feasible
- no binary formats if feasible
- textual data in XML formats and Unicode UTF-8 encoding if feasible
The monitoring of these archival formats in terms of their usage and suitability is an ongoing process. Important sources of information that are consulted are the Library of Congress Sustainability of Digital Formats site and the National Archives current format summary.
In addition to the list of accepted long-term preservation format, the archive also accepts a number of formats for which it can only guarantee bit-stream preservation.
A list of currently accepted archival formats can be found here.
[last updated November 2021]