Private file information in public .eaf files

I am using ELAN with a small team of transcribers, each on different computers with different operating systems, who need to edit each others’ transcriptions.

Also, I would like to publish their .eaf files in a public repository as a corpus. However, I noticed that the .eaf files contain computer-specific file path information for the associated audio/video file in the header: <MEDIA_DESCRIPTOR MEDIA_URL="file:///transcriber_name/path/to/video/file.mp4" MIME_TYPE="video/mp4" RELATIVE_MEDIA_URL="./file.mp4"/>

I want to remove the file path information from the .eaf files, since it could compromise information about the transcribers’ computer setup, and the info is not useful to the public user anyway. However, I noticed that if I simply delete this line, or anonymize MEDIA_URL with a gibberish path, the next transcriber is prompted to locate the media file, and that line gets overwritten with another path anyway.

Is there a way I could effectively remove or anonymize this information? Or get ELAN to stop storing it? I’m grateful for any ideas! Thank you.

Hello,

There is currently no way (e.g. with a preference setting) to tell ELAN not to write the MEDIA_URL and/or the RELATIVE_MEDIA_URL. We can add that to the wish list (if it isn’t already registered).

Based on your description it sounds like removing the path (e.g. by setting it to the empty string or to the same value as the relative path) is something that can best be done after the transcribing phase and before publication in a public repository. Maybe this can be achieved with an XSLT script, or by using/modifying a Python library like pympi?

-Han

1 Like