×

Beyond kappa: A review of interrater agreement measures. (English) Zbl 0929.62117

Summary: J. Cohen [Edu. Psych. Meas. 20, 37-46 (1960)] introduced the kappa coefficient to measure chance-corrected nominal scale agreement between two raters. Since then, numerous extensions and generalizations of this interrater agreement measure have been proposed in the literature. This paper reviews and critiques various approaches to the study of interrater agreement, for which the relevant data comprise either nominal or ordinal categorical ratings from multiple raters. It presents a comprehensive compilation of the main statistical approaches to this problem, descriptions and characterizations of the underlying models, and discussions of related statistical methodologies for estimation and confidence-interval construction. The emphasis is on various practical scenarios and designs that underlie the development of these measures, and the interrelationships between them.

MSC:

62P25 Applications of statistics to social sciences
62H20 Measures of association (correlation, canonical correlation, etc.)
62P15 Applications of statistics to psychology
PDFBibTeX XMLCite
Full Text: DOI Link

References:

[1] Agresti, A model for agreemenl between ratings on an ordinal scale, Biometrics. 44 pp 539– (1988) · Zbl 0707.62227
[2] Agresti, Modelling patterns of agreement and disagreement, Statist, Methods Med. Res. 1 pp 201– (1992)
[3] Agresti, Quasi-symmetric latent class models, with application to rater agreement, Biometrics 49 pp 131– (1993)
[4] Aickin, Maximum likelihood estimation of agreement in the constant predictive model, and its relation to Cohen’s kappa, Biometrics 46 pp 293– (1990) · Zbl 0715.62047
[5] Barlow, Measurement of interrater agreement with adjustment for covariates, Biometrics 52 pp 695– (1996) · Zbl 0875.62533
[6] Barlow, A comparison of methods for calculating a stratified kappa, Statist. Med. 10 pp 1465– (1991)
[7] Bartholomew, Latent variable models for ordered categorical data, J. Econometrics 22 pp 229– (1983)
[8] Bloch, 2 {\(\times\)} 2 kappa coefficients: Measures of agreement or association, Biometrics 45 pp 269– (1989) · Zbl 0715.62113
[9] Byrt, Bias, prevalence and kappa, J. Clin. Epidemiol. 46 pp 423– (1993)
[10] Chinchilli, A weighted concordance correlation coefficient for repeated measurement designs, Biometrics 52 pp 341– (1996) · Zbl 0876.62092
[11] Cicchetti, Comparison of the null distributions of weighted kappa and the C ordinal statistic, Appl. Psych. Meas. 1 pp 195– (1977)
[12] Cohen, A coefficient of agreement for nominal scales, Edu. and Psych. Meas. 20 pp 37– (1960)
[13] Cohen, Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit, Psych. Bull. 70 pp 213– (1968)
[14] Corey, The epidemiology of pregnancy complications and outcome in a Norwegian twin population, Obstetrics and Gynecoi 80 pp 989– (1992)
[15] Davies, Measuring agreement for multinomial data, Biometrics 38 pp 1047– (1982) · Zbl 0501.62045
[16] Donner, A goodness-of-fit approach to inference procedures for the kappa statistic: Confidence interval construction, significance-testing and sample size estimation, Statist. Med. 11 pp 1511– (1992)
[17] Donner, A hierarchical approach to inferences concerning interobserver agreement for multinomial data, Statist. Med. 16 pp 1097– (1997)
[18] Donner, The statistical analysis of kappa statistics in multiple samples, J. Clin. Epidemiol. 49 pp 1053– (1996)
[19] Donner, Testing homogeneity of kappa statistics, Biometrics 52 pp 176– (1996) · Zbl 0880.62110
[20] Dunn, Design and Analysis of Reliability Studies. (1989) · Zbl 0748.62060
[21] Everitt, Moments of the statistics kappa and weighted kappa, British J. Math. Statist. Psych. 21 pp 97– (1968) · doi:10.1111/j.2044-8317.1968.tb00400.x
[22] Feinstein, High agreement but low kappa I: The problems of two paradoxes, J. Clin. Epidemiol. 43 pp 543– (1990)
[23] Fleiss, Measuring nominal scale agreement among many raters, Psych. Bull. 76 pp 378– (1971)
[24] Fleiss, Statistical Methods for Rates and Proportions pp 144– (1973) · Zbl 0269.62006
[25] Fleiss, Inference about weighted kappa in the non-null case, Appl. Psych. Meas. 2 pp 113– (1978)
[26] Fleiss, The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability, Educ. and Psych. Meas. 33 pp 613– (1973)
[27] Fleiss, Jackknifing functions of multinomial frequencies, with an application to a measure of concordance, Amer. J. Epidemiol. 115 pp 841– (1982)
[28] Fleiss, Large sample standard errors of kappa and weighted kappa, Psych. Bull. 72 pp 323– (1969)
[29] Goodman, Simple models for the analysis of association in cross classifications having ordered categories, J. Amer. Statist. Assoc 74 pp 537– (1979)
[30] Goodman, Measures of association for cross classifications, J. Amer. Statist. Assoc 49 pp 732– (1954) · Zbl 0056.12801
[31] Graham, Modelling covariate effects in observer agreement studies: The case of nominal scale agreement, Statist. Med. 14 pp 299– (1995)
[32] Haberman, A stabilized Newton-Raphson algorithm for log-linear models for frequency tables derived by indirect observation, Social. Methodol. 18 pp 193– (1988)
[33] Hamdan, The equivalence of tetrachoric and maximum likelihood estimates of p in 2 {\(\times\)} 2 tables, Biometrika 57 pp 212– (1970) · Zbl 0193.16704
[34] Hutchinson, Kappa muddles together two sources of disagreement: Tetrachoric correlation is preferable, Res. Nursing and Health 16 pp 313– (1993)
[35] Irwig, Exposure-response relationship for a dichotomized response when the continuous underlying variable is not measured, Statist. Med. 7 pp 955– (1988)
[36] Johnson, Distributions in Statistics: Continuous Multivariate Distributions. pp 117– (1972) · Zbl 0248.62021
[37] Kendler, Familial influences on the clinical characteristics of major depression: A twin study, Ada Psychiatrica Scand. 86 pp 371– (1992)
[38] Kraemer, Extension of the kappa coefficient, Biometrics 36 pp 207– (1980) · Zbl 0463.62103
[39] Kraemer, How many raters? Toward the most reliable diagnostic consensus, Statist. Med. 11 pp 317– (1992)
[40] Kraemer, What is the ”right” statistical measure of twin concordance (or diagnostic reliability and validity)?, Arch. Gen. Psychiatry 54 pp 1121– (1997) · doi:10.1001/archpsyc.1997.01830240081011
[41] Kvaerner, Distribution and heritability of recurrent ear infections, Ann. Otol. Rhinol. and Laryngol. 106 pp 624– (1997) · doi:10.1177/000348949710600802
[42] Landis, The measurement of observer agreement for categorical data, Biometrics 33 pp 159– (1977) · Zbl 0351.62039
[43] Landis, A one-way components of variance model for categorical data, Biometrics 33 pp 671– (1977)
[44] Liang, Longitudinal data analysis using generalized linear models, Biometrika 73 pp 13– (1986) · Zbl 0595.62110
[45] Light, Measures of response agreement for qualitative data: Some generalizations and alternatives, Psych. Bull 76 pp 365– (1971)
[46] Lin, A concordance correlation coefficient to evaluate reproducibility, Biometrics 45 pp 255– (1989) · Zbl 0715.62114
[47] Maclure, Misinterpretation and misuse of the kappa statistic, Amer. J. Epidemiol. 126 pp 161– (1987) · doi:10.1093/aje/126.2.161
[48] O’Connell, General observer-agreement measures on individual subjects and groups of subjects, Biometrics 40 pp 973– (1984)
[49] Oden, Estimating kappa from binocular data, Statist. Med. 10 pp 1303– (1991)
[50] Pearson, Mathematical contribution to the theory of evolution VII: On the correlation of characters not quantitatively measurable, Philos. Trans. Roy. Soc. Ser. A 195 pp 1– (1901) · JFM 32.0238.01
[51] Posner, Measuring interrater reliability among multiple raters: An example of methods for nominal data, Statist. Med. 9 pp 1103– (1990)
[52] Qu, Latent variable models for clustered dichotomous data with multiple subclusters, Biometrics 48 pp 1095– (1992)
[53] Qu, Latent variable models for clustered ordinal data, Biometrics 51 pp 268– (1995) · Zbl 0825.62634
[54] Schouten, Estimating kappa from binocular data and comparing marginal probabilities, Statist. Med. 12 pp 2207– (1993)
[55] Scott, Reliability of content analysis: The case of nominal scale coding, Public Opinion Quart. 19 pp 321– (1955)
[56] Shoukri, Maximum likelihood estimation of the kappa coefficient from models of matched binary responses, Statist. Med. 14 pp 83– (1995)
[57] Snedecor, Statistical Methods (1967)
[58] Tallis, The maximum likelihood estimation of correlations from contingency tables, Biometrics 18 pp 342– (1962) · Zbl 0107.14005
[59] Tanner, Modeling agreement among raters, J. Amer. Statist. Assoc 80 pp 175– (1985)
[60] Tanner, Modeling ordinal scale agreement, Psych. Bull. 98 pp 408– (1985)
[61] Uebersax, Latent class analysis of diagnostic agreement, Statist. Med. 9 pp 559– (1990)
[62] Williamson, Assessing interrater agreement from dependent data, Biometrics 53 pp 707– (1997) · Zbl 0881.62123
[63] Zwick, Another look at interrater agreement, Psych. Bull. 103 pp 374– (1988)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.