Re: Sorting and testing multiple distributions
- From: Richard Ulrich <Rich.Ulrich@xxxxxxxxxxx>
- Date: Wed, 27 Sep 2006 16:59:13 -0400
On Wed, 27 Sep 2006 11:41:18 +0200, Philip R Kensche
<pkensche@xxxxxxxxxx> wrote:
Hi,
I have 10 sample distributions of (benchmarking) values and want to sort
them and determine significance of differences. They are not normally
distributed and consist of the identical numbers of samples.
I was now thinking about different approaches to determine the order of
the distributions and how to determine the significance of the
differences:
I'm starting out pretty skeptical here, because most questions
that have ever been posted here about bench marking have been
awfully naive. From the name of it, I tend to think of
bench marking as an experiment that is largely exploratory,
so I am not especially concerned with much complexity in *tests*.
You have to start by being clear about the "dependencies."
What is your "sample distribution"?
Which numbers are fairly compared to each other? What does
it take to describe *all* the relevant results?
Is this computer software? Appropriate questions for software,
concerning what a 'sample distribution' is ...
- Is it re-run exactly, with different background activities?
(Average the results, mention odd ones.)
- Is it tested with different data sets?
(Matched comparisons *may* be appropriate.)
- Is it tested on different machines?
(What is this purpose? - testing software, or testing machines?)
- Or is one 'sample distribution' a comparison of several
implementations for the same task (for instance, compiler speeds)?
bench marking network performance -- Graceful "failure under
excessive load" is a separate consideration from efficiency
with low loads. Do your tests distinguish the elements?
bench marking is most convincing, in what I have read, when
the *functional* units are explained. That is to say: One disk
drive is "faster" because it has faster seek times, faster spin,
faster time-to-settle after a long seek. If these factors all line
up, it is hardly necessary to try find a *statistical* test of
differences. If the factors are not lined up, it may be impossible
to draw any conclusion that does not depend on the exact
conditions of the test.
(1) sort the distributions by their means (they are not normally
distributed, however) and test distributions that are subsequent in the
ordered list by e.g. Kendall's tau to determine significance.
"Not normal" is a weak excuse for transforming the data.
Whether these are timings, or something else, any transformation
discards the original scaling. Was the scaling worthless? (If
duration is measured, you get 'speed' by taking its reciprocal.
That is an alternative to consider. And so on.)
If it is performance, the outliers can be important. I will probably
want to know that "Method A took twice as long" on the full
battery of tests, even if happened to be fastest on 7 of 10 tests,
and only *terrible* on one.
(2) first test for homogeneity with H-test (Kruskal, Wallis) and then use
one of the approaches for multiple pairwise comparisons of mean ranks
(chi^2; Harter, 1960; Tukey-Kramer) that are proposed by my statistics
book of choice (Sachs, Angewandte Statistik, [395]).
(3) test all pairs of distributions and use some multiple testing
correction (Bonferroni, Benjamini, or similar).
For (1) I am not even sure if it is a sound approach at all. (2) and (3)
appear to be correct solutions, but I am not absolutely sure. If they
are both correct which would be the best choice?
Does somebody have an advice? I would appreciate for any help. Thanks!
"Statistical significance" is not the same as "meaningful
difference."
If you have functional explanations for superiority, you won't
have much need for precise statistical testing, considering the
benchmark studies that I have read. And if you don't have
the functional explanations, then you have to base your
conclusions strictly on 'this set of tests.'
Maybe bench-marks is the wrong word. If you are doing a
"controlled study" of relevant factors, then you might want
to consider precise statements.
--
Rich Ulrich, wpilib@xxxxxxxx
http://www.pitt.edu/~wpilib/index.html
.
- References:
- Sorting and testing multiple distributions
- From: Philip R Kensche
- Sorting and testing multiple distributions
- Prev by Date: Re: logistic skewed response
- Next by Date: Re: Nonlinear Least-Squares curve-fitting
- Previous by thread: Sorting and testing multiple distributions
- Next by thread: logistic skewed response
- Index(es):
Relevant Pages
|
|