Automatically detecting the internal language composition of an utterance in a bilingual speech corpus

Marco · 18 November 2023 11:20

Hello,

I am annotating a small bilingual speech corpus, whereby each token is assigned a language label. I’d like to obtain the composition by language of each utterance, based on word tokens, in my corpus. (Each utterance is transcribed and annotated within a specific block on the relevant tier, based on speech pauses). Would there be a way to do so automatically on ELAN? Or should I develop a specific script for this?

Thank you in advance. Best,

Marco

hasloe · 23 November 2023 09:49

Hello Marco,

Just to be sure: assigning the language labels to word tokens is done manually (or is that the part that needs to be automated?) and obtaining the language composition of each utterance is the part your question is about?
If that is correct, it depends on how the composition by language should look like, can that just be e.g. a comma separated list of the language labels as they are assigned to the word tokens of each utterance? In that case maybe one of the export functions can be used, but it would be good to understand what the expected result is.

Best,
Han