Re: Statistical Ranking for Non-Normal Populations
From: George Kahrimanis (anakreon_at_hol.gr)
Date: 10/14/04
- Next message: Richard Ulrich: "Re: Find a distribution!"
- Previous message: Richard Ulrich: "Re: Stdev by dividation"
- In reply to: Peter Hach: "Statistical Ranking for Non-Normal Populations"
- Next in thread: George Kahrimanis: "Re: Statistical Ranking for Non-Normal Populations"
- Reply: George Kahrimanis: "Re: Statistical Ranking for Non-Normal Populations"
- Messages sorted by: [ date ] [ thread ]
Date: Thu, 14 Oct 2004 18:27:05 +0300
"Long message" alert :-[
"Non-parametric Predictive Inference" alert :-)
Peter Hach, on 13 Oct 2004 18:30:39 +0000 (UTC) wrote
>I need to perform (statistical) ranking of a number of large, but
>finite popolations X[i] = (x[i][1], ... ,x[i][n]) in a scenario
>where acquiring each x[i][j] is very expensive. I am looking for the
>population X[i] with the smallest Sum or Average over the x[i][j]
>(i.e. I am only interested in the top-ranked one).
This is not a trivial question, if the data are in short supply.
Here I propose a solution of the "nonparametric predictive" kind.
In this approach, it is not considered meaningful to ask a question
about the underlying pdfs themselves (inasmuch as there is zilch
prior knowledge) but we may pose a question related to the next
sampling of each of the separate populations (indexed by `i').
"Imagine a future sample, with one outcome for each i; what is
the probability that the outcome #1 will be the maximum in that lot?
What is the probability that the outcome #2 will be the maximum? And
so on."
Here is the foundation, in short. Without further assumptions about
the underlying processes, any assumed prior is arbitrary. On the other
hand, confidence intervals have been a disappointment (at least) in
other cases, so let us stay clear of them, or leave them to those who
have some use (like, what -- publish?) for them.
So what is left to do? Consider any sub-sample separately (i.e., for
any fixed i). Think of the next (i.e. future) oucome. Can you form
any prediction on the relative *rank* of the next outcome? The
obvious answer is that, if we had n events of type i, the rank of
the next event will be 1, 2,... n+1 with equal probability: 1/(n+1).
For references, see Section 3a of my
news:<3ce8f26b.0409070825.7c799b23@posting.google.com>
"prediction versus parameter estimation (was: literature: ...)"
7 Sep 2004 09:25:57 -0700).
Sometimes this assignment of probability is regarded as a
separate assumption, but I think that the issue of foundation is
still open. (Check the references.)
To be on the safe side, let us regard this assignment as an
assumption, "A_n", for now. The n outcomes of type i form n+1
plain intervals, when we also take into account +/- infinity or
the bounding values. According to A_n, the probability of the next
type-i outcome is 1/n+1, that it be inside any of these n+1 intervals.
However, we have no way to define probability for any interval
other than those, and their unions. This knowledge is practically
equivalent (in terms of decision strategy) to an "inexactly defined"
(i.e., interval valued) cumulative distribution function ("DF").
EXAMPLE. Say we have 4 outcomes of type i; the DF below the lower
bound is 0 (or the interval [0, 0]); the DF between the lower bound
and the lowest outcome is the interval [0, 1/5]; the DF between the
lowest two outcomes is the interval [1/5, 2/5];
in the next interval, the DF is the interval [2/5, 3/5]; and so on;
over the higher bound, the DF has a single value: 1 (that is, [1, 1]).
(The value of the DF at each node is a detail.)
(We have ignored ties, for now; that issue is trivial.)
We have defined an interval-valued DF for each i, regarding the
next outcome of a type-i measurement.
To define the probability of the next type-1 event, Y_1, being
larger than the next type-2 event, Y_2, given the interval-valued
DF for each type, we can consider the (incompletely defined) random
variable Z_{1,2} == Y_1 - Y_2, and seek what is the probability of
"z_{1,2} > 0". Offhand, we expect that the result will be an
interval, like [p_1, p_2].
We could calculate the DF for Z_{1,2} if we also assumed a PDF
(probability density function) for each of Y_1 and Y_2. Although
no such PDFs are defined, we can define, for each i, the family
of PDFs that are compatible with the corresponding interval-valued
DF. Now take any two such PDFs, one for i=1 and the other for
i=2, and calculate the corresponding DF for Z_{1,2}. Let the
two PDFs vary, each in its family (say, we implement a Monte Carlo)
and (after many blind trials) we identify the minimal and maximal
values for the DF of Z_{1,2}. Fortunately, we only need the extremal
values of the DF at zero only, so that the number of calculations is
not prohibitively large, for small samples.
It is a dumb, computation-intensive solution. I am almost sure that
an elegant one exists, but I still need to work on some fine points
in the proof.
We continue, with Z_{1, 3}, ... Z_(i', i''),... and find what is
the probability interval for each Y_i' to be larger than Y_i''.
By treating outcomes as independent, we form the probability
that Y_1 is maximum, or Y_2 is maximum, and so on.
(Of course, they will come out as interval-valued probabilities.)
I am sorry for the length, but imho this problem is worth it!
Thanks for the problem! ~ George Kahrimanis
- Next message: Richard Ulrich: "Re: Find a distribution!"
- Previous message: Richard Ulrich: "Re: Stdev by dividation"
- In reply to: Peter Hach: "Statistical Ranking for Non-Normal Populations"
- Next in thread: George Kahrimanis: "Re: Statistical Ranking for Non-Normal Populations"
- Reply: George Kahrimanis: "Re: Statistical Ranking for Non-Normal Populations"
- Messages sorted by: [ date ] [ thread ]
Relevant Pages
|