Accepted file types and formats

The tables below list the accepted file types and formats in The Language Archive at the MPI for Psycholinguistics. We make a distinction between formats we support for long-term preservation, including format migrations if those formats become obsolete, and formats for which we can only guarantee preservation of the files themselves (bit-stream preservation) for a minimum period of 10 years, without the guarantee that we will be able to migrate them to future formats.

The listed file extensions are obligatory for deposited files; files will be rejected if their extensions differ from what is specified in the tables.

Long-term Preservation Formats

Media resources

Type

Format

Assigned MIME

Standard MIME

Extension

Remarks

Audio

WAV

audio/x-wav

audio/x-wav

.wav

16/24 bit, 44.1/48 kHz, uncompressed PCM, 1 or 2 tracks.

Video

MPEG1

video/x-mpeg1

video/mpeg

.mpg

Video stream bitrate min. 300 kbit/sec, resolution min. 352x240. Max 2 audio tracks.

MPEG2

video/x-mpeg2

video/mpeg

.mpeg

MPEG2 Program Stream, video stream bitrate min. 3 mbit/sec, resolution min. 640x480. Max. 2 audio tracks.

MPEG4

video/mp4

video/mp4

.mp4

Video codec: H.264, audio codec: AAC. Video stream bitrate min. 300 kbit/sec, resolution min. 352x240. Max. 2 audio tracks.

Image

JPEG

image/jpeg

image/jpeg

.jpg

 

PNG

image/png

image/png

.png

 

TIFF

image/tiff

image/tiff

.tiff

 

SVG

image/svg+xml

image/svg+xml

.svg

 

 

Textual resources

Type

Format

Assigned MIME

Standard MIME

Extension

Remarks

Structured Annotation

EAF

text/x-eaf+xml

text/xml

.eaf

ELAN annotation file format

PFSX

text/x-pfsx+xml

text/xml

.pfsx

ELAN settings file for a specific annotation file. Not strictly necessary to preserve but can be archived along with EAF for convenience.

CHAT

text/x-chat

text/plain

.cha

CHILDES/CLAN text format. Use UTF-8 whenever possible

Toolbox Text

text/x-toolbox-text

text/plain

.tbt

Use UTF-8 whenever possible.

Praat TextGrid

text/praat-textgrid

text/praat-textgrid

.TextGrid

Praat TextGrid annotation file (only plain text variant is accepted, not binary). Use UTF-8 character encoding, not UTF-16.

Unstructured Annotation

Plain text

text/plain

text/plain

.txt

ASCII or UTF-8 character encoding required

HTML

text/html

text/html

.html

ASCII or UTF-8 character encoding required

PDF

application/pdf

application/pdf

.pdf

Embed non-standard fonts

Primary Text

Plain Text

text/plain

text/plain

.txt

ASCII or UTF-8 character encoding required

HTML

text/html

text/html

.html

ASCII or UTF-8 character encoding required

PDF

application/pdf

application/pdf

.pdf

Embed non-standard fonts

ODT

application/vnd.oasis.opendocument.text

application/vnd.oasis.opendocument.text

.odt

Open Document Text

Lexicon

Toolbox Lexicon

text/x-toolbox-lexicon

text/plain

.tbx

Use UTF-8 whenever possible.

CHAT lexicon

text/x-cut

text/plain

.cut

Use UTF-8 whenever possible

Plain Text

text/plain

text/plain

.txt

ASCII or UTF-8 character encoding required

HTML

text/html

text/html

.html

ASCII or UTF-8 character encoding required

Other

Toolbox type

text/x-toolbox-type

text/plain

.typ

Toolbox type file

Toolbox language

text/x-toolbox-language

text/plain

.lng

Toolbox language file

Toolbox sort order

Text/x-toolbox-sortorder

text/plain

.set

Toolbox sort order file

XML

text/xml

text/xml

.xml

Generic XML file. Provide XML schema for non-standardised formats

Schema

text/xml

text/xml

.xsd

XML Schema file

KML

application/vnd.google-earth.kml+xml

application/vnd.google-earth.kml+xml

.kml

Google Earth KML GIS format

ODS

application/vnd.oasis.opendocument.spreadsheet

application/vnd.oasis.opendocument.spreadsheet

.ods

Open Document Spreadsheet

ODP

application/vnd.oasis.opendocument.presentation

application/vnd.oasis.opendocument.presentation

.odp

Open Document Presentation

CSV

text/csv

text/csv

.csv

Comma Separated Values file, ASCII or UTF-8 character encoding required

R script

text/x-R

text/x-R

.R

ASCII or UTF-8 character encoding required. Preserved as text, compatibility with future R versions cannot be guaranteed.

R markdown

text/x-r-markdown

text/markdown

.Rmd, .rmd

ASCII or UTF-8 character encoding required.

 

Medium-term bit-stream preservation formats

Type

Format

Assigned MIME

File Extension(s)

Comment

Binary

NeuroScan image

application/x-neuroscan-img

.img

Needs accompanying .hdr file

NeuroScan image header

application/x-neuroscan-img-hdr

.hdr

Needed for raw NeuroScan image data.

Brain Vision EEG

application/x-brainvision-data

.eeg .seg

Needs accompanying .vhdr and .vmkr files to open.

SPSS data

application/spss-sav

.sav

 

SPSS result view

application/x-spss-spv

.spv

 

NeuroScan History file

application/x-ehst

.ehst

 

NeuroScan ehtp file

application/x-ehtp

.ehtp

 

MATLAB data file

application/x-matlab-data

.mat .fig

 

DICOM file

application/dicom

.IMA .ima .dcm

 

BAM file

application/x-bam

.bam

compressed Sequence Alignment Map, gzip compatible

BAI file

application/x-bai

.bai

index file for a BAM Sequence Alignment Map

“Text”

Brain Vision Header File

text/x-brainvision-header

.vhdr

Needed for opening raw Brain Vision EEG data

Brain Vision Marker File

text/x-brainvision-marker

.vmkr

Needed for opening raw Brain Vision EEG data

NeuroScan History info file

text/x-neuroscan-hfinf

.hfinf

 

MATLAB script

text/x-matlab

.mat

 

Presentation Script

text/x-presentation-script

.pcl .sce

Neurobehavioral Systems Presentation script

Presentation Experiment Settings

text/x-presentation-settings

.exp

Neurobehavioral Systems Presentation Experiment Settings

Praat Pitch

text/praat-pitch

.Praat .praat

Praat Pitch data text file (only plain text variant is accepted, not binary)

 

The archive accepts ZIP/GZIP files for certain types of collections, but only for the purpose of packaging large numbers of files that belong to one “bundle” and are in accepted formats as specified in the tables above. ZIP/GZIP files are generally not accepted for language corpora.