Hi,
I have two questions regarding the calculation of the inter-annotator reliability using Cohen’s kappa.
Is it possible to calculate inter-annotator reliability only with reference two one single value of the controlled vocabulary? So far I have compared two tiers and didn’t get satisfying values, so I was wondering if it’s possible to check every value individually to find out which one/s is/are not reliable enough.
I have noticed that if only two values of the controlled vocabulary are used, the inter-annotator reliability result is NaN (not a number). For example I compare two tiers: for one there is one single long annotation with value “x” and for the other tier there are many annotations with value “x” (for 82% of the annotated video), which are now and then interrupted by other annotations with value “y”.
Since one tier has 100% “x” and the other 82%, I would expect a strong inter-annotator reliability result, even in the case segmentation differs between the two tiers. Instead, I get the NaN value. Could you please explain why this happens?
I always find it difficult to visualize and understand the described scenarios without the data at hand, but I’ll try to answer your two questions.
I guess the answer here should be “no”. The output in the end contains global agreement tables per value, which are still based on the comparison of all values found in the selected tiers (e.g. a table for value A compared to all other, non-A, values). But this doesn’t seem to be what you want, am I right? It sounds like you would want to ignore all annotations except of those with a specific value and then look at the results. What you then would get, I guess, is an idea of how many annotations with that value “match” (in terms of overlap) another annotation and how many are unmatched. The outcome could be quite similar to what you got so far (except maybe for those annotation values that occur only a few times compared to most other values, see e.g. the Limitations section of the kappa Wikipedia page).
Anyway, if you want to try that, you could maybe use e.g. the Tier->Copy Annotations from Tier to Tier to create 2 new tiers both containing only the value you are interested in (that function has such an option). And then apply the calculation to those two tiers. I’m not sure if this makes sense; in general I think one should be careful in finding ‘workarounds’ in case the reliability values are not satisfactory. Maybe there are good reasons to argue that this modified kappa is not a suitable measure for the type of data you have (see the Wikipedia lemma).
Segmentation and overlap are crucial in the kappa implementation in ELAN. So, if one tier only has one long annotation “x” and the other tier has many, there is no agreement. Remember that in the second step of the inter-annotater reliability window you have to specify a minimal percentage of overlap, it’s unlikely that any of the annotations on the second tier fulfills that requirement.
Hello Han,
thanks a lot for your quick answers!
I would have one more question regarding point 2 (for which I am also attaching a screenshot). I understand that segmentation and overlap are crucial for kappa calculation and I have actually read also the Holle&Rein paper. In the situation I am describing, segmentation differs strongly from one tier to the other (since one tier has one single, long segment “x”, whereas the other has many, alternating “x” and “y” segments), but overlap should not be reducing the kappa result, since all “x” segments contained in the second tier overlap completely with the single “x” segment of the first tier. So, I would not expect a very low kappa result. Or am I overlooking/misunderstanding something?
Hi Marta, I don’t see the screenshot, but in this case I think your description is sufficient.
Figure 2 of the Holle&Rein BRM paper illustrates the matching process of the annotations created by two raters. It contains a few examples of annotations of one rater (R1) overlapping multiple annotations of the other rater (R2), in which case only one of the R2 annotations, the best match, is connected to the R1 annotation. The other overlapping annotations are categorized as “unmatched” or “nil”.
In your case all, or all but one, annotations of the second tier are unmatched; the two raters disagree strongly on the segmentation. Like they have been looking at different events or phenomena.
In ‘traditional’ kappa calculation such situations can not occur: the samples are given and both raters get to categorize the same set of samples. In this ‘modified kappa’ procedure (of ELAN and EasyDIAg) the raters more or less have to identify the samples first and then also apply a category to each sample. If the segmentation created by one rater is very different from that produced by the other, there is no high agreement and no high kappa value.
I’m not an expert in this field, by the way, and I’m probably not the best in explaining these things.