I got different kappa from EasyDIAg and ELAN. And I don't know how to read those results

Hi there.
I used EasyDIAg by Holle & Rein and ELAN to test the kappa. However, I got different results.

The results from EasyDIAg said:

  1. Percentage of linked units:
    linked = 0.56 %

  2. Overall agreement indicies (including no match):
    Raw agreement = 0.53
    kappa = 0
    kappa_max = NaN

  3. Overall agreement indicies (excluding no match):
    Raw agreement = 0.96
    kappa = 0.91
    kappa_max = 0.94

And the results from the ELAN -Calculate inter-annotator reliability said:
Global kappa values and agreement matrix:
Overall results (incl. unlinked/unmatched annotations):
kappa_ipf kappa_max raw agreement
0.3420 0.9571 0.5349

Overall results (excl. unlinked/unmatched annotations):
kappa (excl.) kappa_max (excl.) raw agreement (excl.)
0.9061 0.9374 0.9583

And then I am confused about which one should be accepted.
Besides, my results sound so strange that when excluding those unmatched data, my kappa (excl) is much higher than kappa_ipf. But I’ve read the paper from Holle & Rein and some posts before, which said the kappa_ipf is the most important. Then what can I do? My kappa_ipf is low.
Thanks a lot!

Hi Caitlyn,

Yes, that difference is confusing and unfortunate. The results excluding “no match” are the same between the two systems (when rounded to two decimals) but the results including the “no match” are quite different (apart from the raw agreement).
When we (re-)implemented this kappa calculation in ELAN we tried to stay as close as possible to the EasyDIAg implementation and from our own testing and the testing by several research groups the results usually are comparable.
So, I don’t know how to explain the difference in results when including “no match” annotations. I would tend to advise to trust and use the results from the original implementation in case of such differences. I guess, looking at these results, you must have quite a few of unmatched annotations?


1 Like

Thanks a lot, Mr. Han! I am so appreciate it that you have answered my questions.

Yeah we have unmatched places. But
I think it´s related to our topics. Or I guess it’s more about unmatched durations.

We are working on conversational repars (problems). We need to locate the repair sources and classify them into different categories. However, it might be difficult to locate a repair source with exactly the same time.

And I guess different durations might lead to different repair categories as well.

That’s to say, for example, if annotator A chooses a longer duration, he might find more repair categories in that time period. And his annotations would be more than one kind of repair, like "m & n", but B might have divided the long time into two shorter parts with “m” and “n”. Then “m & n” might be unmatched with "m " and “n”. I am not sure.

Then could I just analyse more about the number or percentage of repair categories? But I am also worry about the repair place. Maybe some place has been thought as a repair by annotator A but not B. I am confused now.

Thanks a lot!

Yes, sometimes there can be a kind of amplifying effect: if annotators disagree more on the segmentation, they might disagree more on the categories of segments that do match (although your kappa values excluding unmatched are quite good).

But if annotators disagree a lot on start and/or duration of repairs this can be an indication that “repair” itself is not defined clear enough or that annotators might require more training (so that they more often agree e.g. on whether a repair is one longer unit or should be segmented into two shorter units). A low “including unmatched” score then correctly reflects the disagreement between annotators.
I’m afraid I cannot advise on what to analyse and what to report about the research.


Yeah, I agree. Thanks for your advice and we will review our coding scheme and detect the problems.