Re: How to select data points satisfying two constraints
- From: "Ray Koopman" <koopman@xxxxxx>
- Date: 8 Sep 2006 00:55:40 -0700
youngjin.michael@xxxxxxxxx wrote:
Hi,
I have two sets of data, both of which are specified by two properties
x and y.
Data1: (x_1, y_1), .... (x_n1, y_n1)
Data2: (x_1, y_1), .... (x_n2, y_n2)
where, n1 is the number of data points in the data 1 and n2 is the
number of data points in the data2
For a given subset of Data1, I need to select an equivalent subset of
Data2. In particular, I want to select a subset of Data2 in such a way
that the probability density function of x and y of the selected subset
of Data2 is same as (or close to) the probability density function of x
and y of the given subset of Data1.
Is there a way to complish this goal?
Thanks in advance.
Young-Jin Lee
This is actually two separate questions. The first is how to decide
which of any two given subsets of Data2 has a distribution that is
closer to that of the given subset of Data1. The second is how to
pick the subset of Data2 that optimizes the distributional-closeness
criterion.
In order to answer question 1, you need to say if x and y are in
comparable units. Is the ratio SD(x)/SD(y) meaningful? Or is it
arbitrary, in the sense that standardizing each variable would
change nothing that matters?
If x and y are in arbitrary units, then you should scale x and/or y
to equate their SDs (or some such measure of variability) in Data1,
and then make the same transformation to Data2 (which will generally
not equate the SDs in Data2). It is not clear whether the SDs should
be equated in the full Data1 or in only the given subset.
Once x and y are in comparable units you can attempt to answer the
first question. There is no one right answer. I suggest looking at
f = sum[(x1_i - x2_i)^2 + (y1_i - y2_i)^2], the sum of squares of the
Eucldean distances between corresponding points in the two subsets.
(I think that Mahalanobis distances would not be appropriate here.)
Which brings us to the second question, which itself has two parts:
how to assign each chosen Data2 point to a corresponding Data1 point
so as to minimize f, and then how to choose the best subset of Data2.
The first part is known as the "quadratic assignment problem" and is
famous for being difficult; the second part complicates matters. If
n1 and n2 are small then a brute force solution, that looks at all
n2!/(n2-n1)! possibilities, may be feasible; otherwise you may have
to live with a solution that is not guaranteed to be optimum.
.
- References:
- How to select data points satisfying two constraints
- From: youngjin.michael@xxxxxxxxx
- How to select data points satisfying two constraints
- Prev by Date: How to Calculate Percentage of fit
- Next by Date: Standard Deviation and "False Alarm" Rate
- Previous by thread: How to select data points satisfying two constraints
- Next by thread: How to Calculate Percentage of fit
- Index(es):
Relevant Pages
|
Loading