Calculating inter-rater reliability by calculating the ratio of overlap and total extent


I would love a bit more information on how the measures in the output of this analysis are derived and how one should interpret them. Is there a source that I have maybe overlooked?

Is there a cut off where the average overlap/extent ratio might be said to indicate good or very good inter-rater reliability in terms of segmentation? e.g., is the below considered high?

Average overlap/extent ratio: 0.8182
Overall average overlap/extent ratio: 0.8182

Thanks for your help,

Hi Nicky,

That method for calculating a segmentation agreement ratio is still there for internal use, but since it doesn’t take chance agreement into account and since it is not an accepted measure in any field, it can not be used in publications. It is also not possible to decide in general if a ratio is to be considered high. 0.8182 seems quite good, but it depends on the type of research and the type of events that are being annotated.


The overlap/extent ratio of 0.8182 means that the rates agree quite well on what they’re judging. There’s no strict cutoff point, but generally:

0.5 or lower means not much agreement,
0.5 to 0.75 is okay,
0.75 to 0.9 is okay,
Above 0.9 is excellent.

Your score is 0.8182. It seems good level of agreement. It’s not perfect, but it’s enough to say the raters are mostly in sync with each other.