Re: Subjective Rating Sample Size
From: Bob Wheeler (bwheeler_at_echip.com)
Date: 12/23/04
- Next message: Ray Koopman: "Re: Looking for source for normsinv function"
- Previous message: George Kahrimanis: "Re: Induction of statistical models"
- In reply to: steveluz_at_frii.com: "Re: Subjective Rating Sample Size"
- Next in thread: Richard Ulrich: "Re: Subjective Rating Sample Size"
- Messages sorted by: [ date ] [ thread ]
Date: Thu, 23 Dec 2004 18:35:54 -0500
Read what Rich said more carefully. The usual problem is not about the
agreement of judges, but about the difference in wines. It is a two
dimensional table with judges as rows and wines as columns. The usual
problem makes a comparison between columns. You are making a comparison
between rows. You can do that if you like, since it is only a change of
names, but the multiple dimensionality of the movies will defeat you.
Consider the usual problem: If the judges are experts, they can be asked
to single out a single dimension of the wine, say tartness, and to
provide a rating that may be treated as a univariate statistical
variable. As Rich points out judging whether or not wines differ on
tartness can require a great many judges -- hundreds if the difference
is small. The problem becomes near impossible if you simply ask for a
preference between wines without training the panel on a methodology for
combining the dimensions, since without training there will be clusters
of preferences according to how the judges choose to weight the dimensions.
In your case you have a few judges and many movies. The movies can be
assessed on many dimensions -- acting, direction, writing, etc. Unless
the judges agree on a methodology for combining these dimensions and
train themselves to mutual consistency, you will have clusters which
will defy any statistical test assuming a single dimension and hence any
sample size calculation.
If you were to take the reviews and break them down into dimensional
information you might be able to do something -- for example if every
reviewer expressed an opinion about the writing which was sufficiently
precise for you to assign a numerical value then you might be able to
calculate a sample size if you could figure out the standard deviation
of this variable and decide on a minimum detectable difference for it.
Lots of luck!
steveluz@frii.com wrote:
> Thanks for your reply. I'm not a statistician as you probably note,
> but let me say what I originally was thinking.
>
> This seemed to me to be a lot like wine rating by a panel of judges,
> and there seems to be a fair amount of information on saying whether or
> not the judges agree. I'm not sure what the dimension has to do with
> this - judges will generally give a single ranking (value) to each wine
> (and might have some ties) that is based on their expert opinion.
>
> I thought that either a Kappa 'coefficient?' for option 1 below, or a
> one-way anova for option 2, would let me say whether the critics agreed
> or not. Then I felt that I could see how the significance level
> changed with increasing the number of movies, and perhaps be able to
> say for n critics and m movies that if the critics are in agreement
> that I'm x% confident that the critics rate the movies the same. It
> seemed to me that increasing the number of movies would increase the
> confidence level. I originally got stuck trying to find a significance
> level for the Kappa. Sounds like I'm all wet.
>
> Richard Ulrich wrote:
>
>>On Thu, 23 Dec 2004 14:21:06 +0000 (UTC), steveluz@frii.com
>
> (steveluz)
>
>>wrote:
>>
>>
>>>I agreed to take on this problem from a colleague because it seemed
>>>straight forward - I was wrong as usual.
>>>
>>>Given that I have a set of movie critics (perhaps a half-a-dozen to
>
> a
>
>>>dozen) each rating movies. I'd like to find the minimum number of
>>>movies that each critic has to rate to be able to say statistically
>>>whether they "generally" agree or not (i.e. do the critics all rate
>>
>>Bob has pointed out that it isn't generally one dimension
>>for a rating. SO the question is not very sound.
>>
>>I'll point out, further, that the easier sample size determinations
>>are for the *opposite* goal than yours -- showing that
>>if two raters disagree by "this much" might take 100
>>paired ratings, or 500, depending on what the amount is.
>>
>>Or, if they really *disagree* by a whole lot, it might take
>>relatively few.
>>
>>
>>What do you want say about agreements? Do you want
>>confidence about yes-no agreeing 90% of the time? (Is that
>>a base with 50% of each? -- I mostly try to see movies that
>>I will *like*, so I find 90% or more of my viewings to be at
>>least "okay".)
>>
>>
>>
>>
>>>the movies the same to some significance.) I have the luxury of
>>>knowing that all critics rate all of the movies and that each is an
>>>"expert". Two options are:
>>>
>>>1) each critic gives a thumbs up or thumbs down (1 or 0)
>>>2) each critic rates the movies on some scale (e.g., 0 to 4 stars)
>>>
>>>I'd like to say something like to %confidence level, all critics
>
> agree
>
>>>on the ratings (my H0) or don't agree. Again, I'm looking for the
>>>minimum number of movies to rate to be able to do some type of
>>>significance test.
>>
>>A significance *test* would show a difference; the absence
>>of a difference is "no-verdict" rather than saying
>>"On the average, they are assured of being the same."
>>Of course, the *average* could be the same, while they
>>disagree on every single movie, so that's something to
>>think of, too.
>>
>>You might want to imagine various numerical results,
>>where you think raters are "the same" or "different"; and
>>then see what a "test" says about them.
>>
>>--
>>Rich Ulrich, wpilib@pitt.edu
>>http://www.pitt.edu/~wpilib/index.html
>
>
-- Bob Wheeler --- http://www.bobwheeler.com/ ECHIP, Inc. --- Randomness comes in bunches.
- Next message: Ray Koopman: "Re: Looking for source for normsinv function"
- Previous message: George Kahrimanis: "Re: Induction of statistical models"
- In reply to: steveluz_at_frii.com: "Re: Subjective Rating Sample Size"
- Next in thread: Richard Ulrich: "Re: Subjective Rating Sample Size"
- Messages sorted by: [ date ] [ thread ]