Guidelines for analysis on measuring interrater reliability of nursing outcome classification

Intansari Nurjannah, Sri Marga Siwi


Indicators in nursing outcome classification (NOC) need to be tested for their validity and reliability. One method to measure reliability of NOC is by using interrater reliability.  Kappa and percent agreement are common statistic analytical methods to be used together in measuring interrater reliability of an instrument. The reason for using these two methods at the same time is that those statistic analytical methods have easy reliability interpretation. Two possible conflicts may possibly emerge when there are asynchronies between kappa value and percent agreement. This article is aimed to provide guidance when a researcher faces these two possible conflicts. This guidance is referring to interrater reliability measurement using two raters.


Interrater reliability, Kappa, Percent agreement

Full Text:



Johnson M. Overview of the nursing outcomes classification (NOC). 2013. Available from: (cited 2016 December 19th).

Moorhead S, Johnson M, Maas M, Swanson E. Nursing outcome classification (NOC). 5th edition. St Louis, Missouri: Elsevier Saunders; 2013.

Phelan C, Wren J. Exploring reliability in academic assessment. 2006. Available from: (cited 2015 May 5th).

Scholtes V, Terwee C, Poolman R. What makes a measurement instrument valid and reliable? Injury. 2011;42(3):236-40.

Kimberlin C, Winterstein AG. Validity and reliability of measurement instruments used in research. Am J Health-Syst Pharm. 2008;65(23):2276-84.

Morris R, MacNeela P, Scott A, Treacy P, Hyde A, O’Brien J, et al. Ambiguities and conflicting results: The limitations of the kappa statistic in establishing the interrater reliability of the Irish nursing minimum data set for mental health: A discussion paper. Int J Nurs Stud. 2008;45(4):645-7.

Craddock J. Interrater reliability of psychomotor skill assessment in athletic training: ProQuest; 2009.

Cargo M, Stankov I, Thomas J, Saini M, Rogers P, Mayo-Wilson E, et al. Development, interrater reliability and feasibility of a checklist to assess implementation (Ch-IMP) in systematic reviews: the case of provider-based prevention and treatment programs targeting children and youth. BMC Med Res Methodol. 2015;15(1):1.

McHugh M. Interrater reliability: the kappa statistic. Biochemia Medica. 2012;22(3):276-82.

O'Leary S, Lund M, Ytre-Hauge TJ, Holm SR, Naess K, Dailand LN, et al. Pitfalls in the use kappa when interpreting agreement between multiple raters in reliability studies. Physiotherapy. 2014;100(1):27-35.

Cunningham M, editor. More than just the kappa coefficient: a program to fully characterize interrater reliability between two raters. SAS global forum; 2009.

Sim J, Wright C. The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Physic therap. 2005;85(3):257-68.

van der Vleuten C. Validity of final examinations in undergraduate medical training. Br Med J. 2000;321(7270):1217.

Kottner J, Audigé L, Brorson S, Donner A, Gajewski B, Hróbjartsson A, et al. Guidelines for reporting reliability and agreement studies (GRRAS) were proposed. Int J Nurs Stud. 2011;48(6):661-71.

Rushforth H. Objective structured clinical examination (OSCE): review of literature and implications for nursing education. Nurse Education Today. 2007;27(5):481-90.

Graham M, Milanowski A, Miller J. Measuring and promoting interrater agreement of teacher and principal performance ratings. Online Submission. Center for Educator Compensation Reform. 2012.

House A, House B, Campbell M. Measures of interobserver agreement: Calculation formulas and distribution effects. J Behav Assess. 1981;3(1):37-57.

Viera A, Garrett J. Understanding interobserver agreement: the kappa statistic. Fam Med. 2005;37(5):360-3.

McCray G, ed. Assessing interrater agreement for nominal judgement variables. Language Testing Forum; 2013.

Landis J, Koch G. The measurement of observer agreement for categorical data. biometrics. 1977:159-74.

Fleiss JL. Measuring nominal scale agreement among many raters. Psychological Bulletin. 1971;76(5):378-82.

Altman DG. Practical statistics for medical research. 1 ed. London; New York: Chapman and Hall. 1991.

Feinstein A, Cicchetti D. High agreement but low kappa: I. The problems of two paradoxes. J clin epidem. 1990;43(6):543-9.

Weiner J. Measurement: reliability and validity measures. Bloomberg School of Public Health, Johns Hopkins University, mimeo (Power Point Presentation) at http://ocw jhsph edu/ courses/ hsre/ PDFs/ HSRE_lect7_weiner pdf http://jae oxfordjournals org. 2007.

Besar M, Siraj H, Manap R, Mahdy Z, Yaman M, Kamarudin M, et al. Should a single clinician examiner be used in objective structure clinical examination? Procedia-Social and Behavioral Sciences. 2012;60:443-9.

Krippendorff K. Agreement and information in the reliability of coding. Communication Methods and Measures. 2011;5(2):93-112.

Siwi S, Nurjannah I. Interrater Reliability pada Checklist Penilaian Pemberian Huknah di Program Studi Ilmu Keperawatan Fakultas Kedokteran Universitas Gadjah Mada. [Unpublished Thesis]. In press 2016.

Cicchetti D, Feinstein A. High agreement but low kappa: II. Resolving the paradoxes. J clinical epidem. 1990;43(6):551-8.

Joyce M, editor. Picking the best intercoder reliability statistic for your digital activism content analysis. Digital Activism Research Project: Investigating the Global Impact of Comment Forum Speech as a Mirror of Mainstream Discourse. 2013.

Xier L. Kappa—A Critical Review. 2010; Available from: smash/ get/ diva2:326034/FULLTEXT01.pdf.

Flight L, Julious S. The disagreeable behaviour of the kappa statistic. Pharmaceutical statistics. 2015;14(1):74-8.

Byrt T, Bishop J, Carlin J. Bias, prevalence and kappa. Journal of clinical epidemiology. 1993;46(5):423-9.

Streiner DL, Geoffrey RN, Cairney J. Health Measurement Scales: A Practical Guide to Their Development and use. United Kongdom: Oxford University Press. 2015.

Kvålseth T. Measurement of Interobserver Disagreement: Correction of Cohen’s Kappa for Negative Values. J Probab Statist. 2015;2015.

Xie Q. Agree or Disagree? A Demonstration of An Alternative Statistics Cohen’s Kappa for Measuring the Extent and Reliability of Agreement between Observers. 2013 [cited 2016 March 12]; Available from: files/ 2014/05/ J4_Xie_2013FCSM.pdf