Converting CLAN files to ELAN files

sophie-h-e · 4 November 2013 18:50

I am in the process of trying to figure out the best way to convert a large, transcribed corpus into a more up-to-date format and was wondering if anyone had any pointers…

The files are in the CLAN .cha (CHAT) format and are time aligned, audio linked files. The audio files were in .aiff and are now also in .wav files.
So far I have tried a number of things to do this in an automated way with not much success:

I imported the files into ELAN using the import function. This worked insofar as when they are imported I can see the files and play the corresponding audio. However, when I try to save them into the ELAN .eaf format I get the following error message:
“Unable to save this file The character “ us an invalid xml character”
If I ignore this and save the file anyway it is empty when I re-open it.
I tried exporting the files as .TextGrid’s to see whether they could then be read by Praat. Then possibly reimported as a TextGrid and then saved as the ELAN format. Praat could not read the TextGrtid and although ELAN imported the file it was incomplete (lots of the turns were missing) and it was not linked to the audio.
Finally, another option was to export the text files and then manually go back through and time-stamp the files. This method does work but will be extremely time consuming so if possible I would like to use a more efficient method.

Any advice gratefully received!
Sophie Holmes-Elliott

hasloe · 6 November 2013 21:23

The advised way to convert .eaf is by using the CLAN command chat2elan. That would probably be the first suggestion.

If that does not work I think the only approach worth trying is 1. The invalid xml character message in most cases indicates that something went wrong during import: the import function is fairly old and limited, and hasn’t been updated regularly. Chat files use “Bullet” characters to separate time information from annotation values and in some cases this is not detected correctly.
We’ll try to update our importer such that the invalid characters are removed, so that the result can be saved as .eaf (regardless of import errors and mis-interpretation of the data).

-Han

system · 19 November 2013 11:36

This topic is also relevant for Trova annotation content search, for the Annex viewer, for CQL search and for mfsearch. The fast Childes Clan Chat file reader of those understands several Chat variants. It can do “sloppy” processing to still get most content from files with exotic syntax. However, I would be happy to receive information about the intended interpretation of syntax variants to improve parsing in the future. If you want, you can send me some example files or documentation about your specific syntax.

Thanks! Eric Auer