Re: kappa & ICC questions



On Mon, 17 Nov 2008 21:25:35 -0800 (PST), Gay A <g.armsden@xxxxxxxxx>
wrote:

Hello,

I've learned a lot of useful things from reading the knowledgeable
answers provided on this list -- so, thank you.. But I don't see
answers to my questions. I have read widely from articles available on
the internet to no avail. I have consulted 2 statisticians who gave me
different answers. I would very much appreciate any enlightening
comments you can offer.

I'll list the questions and then provide background info.

(1) Can the data from more than one pairs of raters in an inter-rater
reliability study (or all raters in a test-retest reliability study)
be entered into one kappa analysis? I would think not, as it would
change the parameters of calculating chance agreement. However, I have
seen several published papers where that was done. Plus, one
statistician I consulted thought I could (but another disagreed).

There exists a creature called a multi-rater kappa. I have never
used it, and I would avoid using it. It conceals more that it
reveals.

Kappa is most often presented for dichotomies, such as diagnosis.
I do not like it for anything more than dichotomies because of its
dependencies on marginal-frequencies, except for the purpose of
describing a set of narrowly-differing tables.


(2) Can I reasonably use ICC on ordinal data from behavior ratings of
Poor, Fair, Good, Very Good (corresponding 1,2,3,4 response choices
are used on one measure, but no numbers are used on the 2nd measure).
I have read that it would be appropriate at least for the first
measure.

Assign similar numbers and you can use ICC for both. Keep in
mind that ICC consists of a family of measures, estimating the
separate raters or average-of-raters; and estimating for *these*
raters or for a random set of raters.


Background: I'm conducting reliability studies on 2 measures developed
at a social service agency. It has taken a year to collect a total N
of 100 test-retest sets of scores from 10 raters who each rated
different clients. (See response choices in (2) above). We have not
finished collecting the inter-rater reliability data yet, but in that
study each pair of raters rated different clients. The N of clients
rated by each rater in the test-retest reliability study ranges from 9
- 27 (Md=16).

The problem with using kappa on each rater's (or rater-pair's) ratings
is that there are 16 cells in the analysis and only 9-27 ratings, so
many cells can have 0's. As you know, calculating kappas on this kind
of data is problematic. The agency would rather not collect data for
another year to fill those empty cells. So, that's why I'm asking the
questions above.

Ns of 9-27 are pretty darn small for confirming reliablity.
Kappa is pretty poor for assessing reliability for other than
dichotomies. Use Pearson r's and t-tests to check on the
similarity and differences of pairs of raters, where you have
pairs -- this can show you where hazardous differences *may*
exist.

Keep in mind that Reliability is an assessment of *raters*
given a particular *population* (or sample). That is, if you
have small variability in a sample, you can't expect to get high
reliabilitiy, in terms of correlation or kappa.

Also, test-retest data over a year is different from two-raters
in a narrow time frame. Both can be useful. But the two should
not be mixed together and confused.



Would anyone have any comments or suggestions? I don't have a copy of
Fleiss's 1981 book on Rates and Proportions. If there answers are in
there I will shell out the $90.

Rather than particular tests in Fleiss, I think you need an
overview of reliability and testing. Look in your library for
whatever is listed under psychometrics, and browse.


Thanks again for this very valuable list.

--
Rich Ulrich
.



Relevant Pages

  • Re: sample size for kappa?
    ... Kappa is not a very good 'absolute' statistic beyond the 2x2 case. ... raters and a audit tool/questionnaire that has 250 ... know if there is inter rater agreement between the 2 raters on the 250 ... agreement in measuring a single 'dimension' that is shared ...
    (sci.stat.consult)
  • Re: comparing Kappa Statistics in case of dependence
    ... we would like to compare different kappa statistics. ... Here is the simple situation for 3 raters, ... Z disagrees with Y ...
    (sci.stat.math)
  • Re: Quadratic weighted Kappa and the Intraclass Correlation Coefficient
    ... kappa, using quadratic weights, asymptotically ... The Case 2 ICC assumes that the two raters compared are a random ... weighted kappa assumes that the two raters considered are the only ... random sample. ...
    (sci.stat.edu)
  • Re: interrater reliability
    ... >> I am wondering how to evaluate interrater reliability ... My study purpose is to evaluate the interrater ... >poor if the Subjects are nearly identical ... How often are the raters ...
    (sci.stat.math)
  • Re: test re test reliability
    ... intra class correlation is being advocated as ... > a more accurate expression of reliability over time as it also ... raters is good for planning for multiple raters. ...
    (sci.stat.edu)