Re: Sorting and testing multiple distributions



On Wed, 27 Sep 2006 11:41:18 +0200, Philip R Kensche
<pkensche@xxxxxxxxxx> wrote:

Hi,

I have 10 sample distributions of (benchmarking) values and want to sort
them and determine significance of differences. They are not normally
distributed and consist of the identical numbers of samples.

I was now thinking about different approaches to determine the order of
the distributions and how to determine the significance of the
differences:

I'm starting out pretty skeptical here, because most questions
that have ever been posted here about bench marking have been
awfully naive. From the name of it, I tend to think of
bench marking as an experiment that is largely exploratory,
so I am not especially concerned with much complexity in *tests*.

You have to start by being clear about the "dependencies."
What is your "sample distribution"?
Which numbers are fairly compared to each other? What does
it take to describe *all* the relevant results?


Is this computer software? Appropriate questions for software,
concerning what a 'sample distribution' is ...
- Is it re-run exactly, with different background activities?
(Average the results, mention odd ones.)
- Is it tested with different data sets?
(Matched comparisons *may* be appropriate.)
- Is it tested on different machines?
(What is this purpose? - testing software, or testing machines?)
- Or is one 'sample distribution' a comparison of several
implementations for the same task (for instance, compiler speeds)?


bench marking network performance -- Graceful "failure under
excessive load" is a separate consideration from efficiency
with low loads. Do your tests distinguish the elements?

bench marking is most convincing, in what I have read, when
the *functional* units are explained. That is to say: One disk
drive is "faster" because it has faster seek times, faster spin,
faster time-to-settle after a long seek. If these factors all line
up, it is hardly necessary to try find a *statistical* test of
differences. If the factors are not lined up, it may be impossible
to draw any conclusion that does not depend on the exact
conditions of the test.



(1) sort the distributions by their means (they are not normally
distributed, however) and test distributions that are subsequent in the
ordered list by e.g. Kendall's tau to determine significance.

"Not normal" is a weak excuse for transforming the data.

Whether these are timings, or something else, any transformation
discards the original scaling. Was the scaling worthless? (If
duration is measured, you get 'speed' by taking its reciprocal.
That is an alternative to consider. And so on.)

If it is performance, the outliers can be important. I will probably
want to know that "Method A took twice as long" on the full
battery of tests, even if happened to be fastest on 7 of 10 tests,
and only *terrible* on one.



(2) first test for homogeneity with H-test (Kruskal, Wallis) and then use
one of the approaches for multiple pairwise comparisons of mean ranks
(chi^2; Harter, 1960; Tukey-Kramer) that are proposed by my statistics
book of choice (Sachs, Angewandte Statistik, [395]).

(3) test all pairs of distributions and use some multiple testing
correction (Bonferroni, Benjamini, or similar).

For (1) I am not even sure if it is a sound approach at all. (2) and (3)
appear to be correct solutions, but I am not absolutely sure. If they
are both correct which would be the best choice?

Does somebody have an advice? I would appreciate for any help. Thanks!

"Statistical significance" is not the same as "meaningful
difference."

If you have functional explanations for superiority, you won't
have much need for precise statistical testing, considering the
benchmark studies that I have read. And if you don't have
the functional explanations, then you have to base your
conclusions strictly on 'this set of tests.'

Maybe bench-marks is the wrong word. If you are doing a
"controlled study" of relevant factors, then you might want
to consider precise statements.



--
Rich Ulrich, wpilib@xxxxxxxx
http://www.pitt.edu/~wpilib/index.html
.



Relevant Pages

  • Re: Sorting and testing multiple distributions
    ... for much more detailled analyses. ... There are "functional units" ... however) and test distributions that are subsequent in the ... that are proposed by my statistics ...
    (sci.stat.math)
  • Re: Clergy Letter Project exceeds 11,000 signatures
    ... can you defend the common ancestry of all primates? ... "the legitimacy of random distributions"? ... statistics before I can defend anything else? ... Well as statistics was developed in part to deal with evolution, ...
    (talk.origins)
  • Re: Clergy Letter Project exceeds 11,000 signatures
    ... mean "the legitimacy of random distributions"? ... statistics in his defense of evolution. ... But I once mentioned that strict determinism such as ... I think you are confusing the notions of "knowing" that evolution occurs ...
    (talk.origins)
  • Re: Clergy Letter Project exceeds 11,000 signatures
    ... can you defend the common ancestry of all ... mean "the legitimacy of random distributions"? ... statistics in his defense of evolution. ... But I once mentioned that strict determinism such as ...
    (talk.origins)
  • Re: Statistical Ranking for Non-Normal Populations
    ... >> distributions. ... >> significantly in variance. ... >in statistics, and that one does not seem ...
    (sci.stat.math)