Inter-annotator reliability (Cohen's kappa)

Hi,
I have two questions regarding the calculation of the inter-annotator reliability using Cohen’s kappa.

  1. Is it possible to calculate inter-annotator reliability only with reference two one single value of the controlled vocabulary? So far I have compared two tiers and didn’t get satisfying values, so I was wondering if it’s possible to check every value individually to find out which one/s is/are not reliable enough.

  2. I have noticed that if only two values of the controlled vocabulary are used, the inter-annotator reliability result is NaN (not a number). For example I compare two tiers: for one there is one single long annotation with value “x” and for the other tier there are many annotations with value “x” (for 82% of the annotated video), which are now and then interrupted by other annotations with value “y”.
    Since one tier has 100% “x” and the other 82%, I would expect a strong inter-annotator reliability result, even in the case segmentation differs between the two tiers. Instead, I get the NaN value. Could you please explain why this happens?

Thanks a lot in advance,
Marta

Hello Marta,

I always find it difficult to visualize and understand the described scenarios without the data at hand, but I’ll try to answer your two questions.

  1. I guess the answer here should be “no”. The output in the end contains global agreement tables per value, which are still based on the comparison of all values found in the selected tiers (e.g. a table for value A compared to all other, non-A, values). But this doesn’t seem to be what you want, am I right? It sounds like you would want to ignore all annotations except of those with a specific value and then look at the results. What you then would get, I guess, is an idea of how many annotations with that value “match” (in terms of overlap) another annotation and how many are unmatched. The outcome could be quite similar to what you got so far (except maybe for those annotation values that occur only a few times compared to most other values, see e.g. the Limitations section of the kappa Wikipedia page).
    Anyway, if you want to try that, you could maybe use e.g. the Tier->Copy Annotations from Tier to Tier to create 2 new tiers both containing only the value you are interested in (that function has such an option). And then apply the calculation to those two tiers. I’m not sure if this makes sense; in general I think one should be careful in finding ‘workarounds’ in case the reliability values are not satisfactory. Maybe there are good reasons to argue that this modified kappa is not a suitable measure for the type of data you have (see the Wikipedia lemma).

  2. Segmentation and overlap are crucial in the kappa implementation in ELAN. So, if one tier only has one long annotation “x” and the other tier has many, there is no agreement. Remember that in the second step of the inter-annotater reliability window you have to specify a minimal percentage of overlap, it’s unlikely that any of the annotations on the second tier fulfills that requirement.

The ELAN manual is quite brief on this topic but refers to a Holle&Rein paper and to the EasyDIAg manual. It would be useful to consult those, if you haven’t done so already (the function in ELAN is a partial re-implementation of EasyDIAg).
I hope this helps!

-Han

Hello Han,
thanks a lot for your quick answers!
I would have one more question regarding point 2 (for which I am also attaching a screenshot). I understand that segmentation and overlap are crucial for kappa calculation and I have actually read also the Holle&Rein paper. In the situation I am describing, segmentation differs strongly from one tier to the other (since one tier has one single, long segment “x”, whereas the other has many, alternating “x” and “y” segments), but overlap should not be reducing the kappa result, since all “x” segments contained in the second tier overlap completely with the single “x” segment of the first tier. So, I would not expect a very low kappa result. Or am I overlooking/misunderstanding something?

Thanks again and best regards,
Marta

Hi Marta, I don’t see the screenshot, but in this case I think your description is sufficient.
Figure 2 of the Holle&Rein BRM paper illustrates the matching process of the annotations created by two raters. It contains a few examples of annotations of one rater (R1) overlapping multiple annotations of the other rater (R2), in which case only one of the R2 annotations, the best match, is connected to the R1 annotation. The other overlapping annotations are categorized as “unmatched” or “nil”.
In your case all, or all but one, annotations of the second tier are unmatched; the two raters disagree strongly on the segmentation. Like they have been looking at different events or phenomena.

In ‘traditional’ kappa calculation such situations can not occur: the samples are given and both raters get to categorize the same set of samples. In this ‘modified kappa’ procedure (of ELAN and EasyDIAg) the raters more or less have to identify the samples first and then also apply a category to each sample. If the segmentation created by one rater is very different from that produced by the other, there is no high agreement and no high kappa value.
I’m not an expert in this field, by the way, and I’m probably not the best in explaining these things.

Your explanation helps me a lot!
Thank you very much for your time!
Best,
Marta