As the title suggests, I have some eaf files wherein if you split an existing annotation it will alter the TIME_VALUE attribute of other time slots. I’m attaching images with screenshots of the TIME_ORDER in XML after splitting annotations.
Validating eafs with elan doesn’t show any errors. It’s happening on Linux and Windows and in different versions of elan (5.8, 5.9, 6.2, 6.3), but affects certain files, so I think it’s probably caused in the eaf, but I can’t see any issues visually comparing the xml between files where split is a problem and files where it isn’t.
It indeed seems there is something wrong in the eaf, but based on the example alone, it is difficult to tell what. If you can send one of those files to me, I can have a look and e.g. repeat the error in debug mode to see where things go wrong.
Concerning debug mode:
a user could only run the application in debug mode by downloading the source code and building and running it in a development environment which allows to set breakpoints. Another, more realistic, option would be for the user to set the level of granularity of logging messages. This is currently only possible in the preferences window for some media player frameworks, but not for the application itself. At the moment that wouldn’t have much effect, because in most of the code messages are logged without specifying a level of severity.
For the problem at hand it wouldn’t make a difference, as I will explain in the follow up post.
Now concerning the problem occurring when splitting annotations:
Looking at the file you send me, made clear that the problem is caused by invalid time slot references of annotations on the (top level) transcription tier (name=orthographic_transcription). Please have a look at the following screenshot:
I highlighted the error: TIME_SLOT_REF2 of the first annotation is the same as TIME_SLOT_REF1 of the next annotation and this is not how ELAN expects it on a top level tier (this should only occur on Time Subdivision tiers). ELAN shouldn’t create and write such structures.
When a file is loaded in ELAN there is a check on invalid sharing of time slots between annotations on unrelated tiers, but not (yet) on invalid sharing on the same tier.
In my previous post I forgot to mention the option File->Validate EAF File... which does a bit more than just XML validation. It doesn’t detect this particular error, but shows a few other errors related to the time slots, see next screenshot:
I have no idea how the structure in your file could have arisen, especially since the error is consistent for the whole tier.
If this problem isn’t on our list of issues yet, I’ll add it; it would be good if at least the Validate EAF File option would signal this as problematic.
Just for clarity, if I understand correctly no time slots should have the same TIME_SLOT_REF value on the same tier, i.e., the two highlighted parts of the XML should be “ts2” and “ts3”, even if the values are (roughly) the same. In this case, they are the exactly same, but maybe that’s also a problem – does Elan consider this an overlap?
ELAN shouldn’t create and write such structures.
It didn’t. The example .eaf I sent you was generated with scripting – with 4000+ audio files it would be a nightmare to do it manually – I’ve used this strategy before without any issue, but I apparently forgot some critical info about what Elan expects of its XML. But if I understand the problem correctly it will be a straightforward fix.
The annotations that generated Validation errors are meant to be deleted anyway. In most of our data, these annotations were deleted from the XML before Elan ever saw it, but (for reasons I don’t need to go into), a subset of files couldn’t be subject to that process. All our data already has an associated .eaf files, but if I take this strategy again in the future, I’ll account for that.
Yes, that’s the current situation, no annotations on the same (top level) tier should have the same TIME_SLOT_REF. TIME_SLOTs can have the same value; if the begin time of an annotation has exactly the same time as the end time of the previous that is no problem and not considered an overlap. That is how ELAN creates and stores annotations that have no gap in between (e.g. if you create a new annotation partially ‘on top of’ an existing annotation).
There are some ideas about a command line interface for ELAN, but no concrete implementation plans yet. If at all, the first implementation will probably mainly deal with EAF creation, conversion, modification etc., not so much with e.g. running recognizers (but maybe that is not the part you need the ELAN command line for?).
In my case (which is admittedly, probably not a typical use case of Elan), we generated eaf files, recognized utterance chunks, split audio (made copies of segments corresponding to chunks), sent audio to ASR, and populated return text into the eaf files all with scripting. So the spoken language was already annotated and transcribed before the first time anyone opened an eaf with Elan – we use Elan itself “only” to correct ASR-generated transcriptions and for its excellent search functionality. Running the recognizers from the command line would certainly be an asset for this type of workflow, as would creating the eaf actually with Elan, thereby avoiding my TIME_SLOT_REF mistake.
The problem with TIME_SLOT_REFs only became apparent well into the project when someone on the team decided to split an annotation for the first time, and everything actually works fine if no edits are made to annotations’ start and end points.