Re: Estimate if two bivariate sets are statistically different



On Jul 16, 11:26 am, "m.s." <deviceran...@xxxxxxxxx> wrote:
On Jul 16, 5:57 pm, Ray Koopman <koop...@xxxxxx> wrote:
On Jul 16, 2:53 am, "m.s." <deviceran...@xxxxxxxxx> wrote:
On Jul 16, 1:03 am, Ray Koopman <koop...@xxxxxx> wrote:
Do a scatterplot of the merged datasets, with nothing to show which
point came from which set. Partition the plane -- perhaps, but not
necessarily, by grid lines; if there appear to be natural breaks they
may be used -- into regions that are as compact as possible, subject
to the constraint that each region have at least 5(n1+n2)/min(n1,n2)
points, where n1 and n2 are the two sample sizes. This should be done
by someone who knows nothing about either dataset or how they might
differ. Then do a 2 x R chi-square test comparing the distributions
of the two datasets over the R regions.

I thought something similar:
- do abivariatekernel density estimation on dataset A
- choose an arbitrary isodensity contour line that encloses an area
with most of the pattern of A
- see how many points of B are inside or outside A

but this:
- does not see if points of B are evenly distributed inside A or not
- is not sensitive to the distance of points of B from the "good" area

so it doesn't seem the best solution to me.

The procedure I described explicitl avoids using any "interesting"
or "special" features of either sample distribution alone, on the
grounds that that would be taking advantage of chance. I see no way
to adjust for such opportunism, which would tend to wrongly reject
the null too often, because the set of potential features was not
specified a priori.

Ok, now I understand. Still, it seems pretty sensitive to the
arbitrary grid construction.
Should I try with more than one kind of grid and see if tests are
consistent?

Yes, the result can depend to some extent on how the grid is placed.
The question is similar to (but a little more complicated than)
asking how much the look of a histogram depends on the width of
the interval and where the boundaries fall, and the answer is also
similar: try changing things a little, and see how much it matters.
If the chi-square and p-value don't change much then you can be
confident in whatever conclusion you reach. But if they do change
substantially then the best you can say is that you have too little
information to reach a firm conclusion, that you need either more
data or an independent -- i.e., not based on your current data --
restriction on the possible distributions that must be considered.

Or that you may use kernel density estimation, I'd say... (You still
have the kernel bandwidth problem, but you avoid the boundaries
problem and for the bandwidth/bin size, there are calculations that
give safe indications...)

Here is s Monte Carlo test of a simplified version of the procedure
you propose.

Generate 2n i.i.d. bivariate normal observations. Randomly label n
of them as comprising set A, and the other n as set B. (Now comes
the simplification.) Instead of fitting a kernel density to set A,
get its mean vector and covariance matrix, and use those to find
the median Mahalanobis distance of the points in set A from their
centroid. Then get the Mahalanobis distances for set B, using the
mean vector and covariance matrix from set A, and count how many of
those distances are smaller than the median of the set A distances.
By your argument, there should be about n/2 of them. Extending your
argument, the count should be a Binomial[n,1/2] random variable,
with mean = n/2 and standard devation = sqrt[n/4].

I did the above 10^4 times, with n = 36.
Here is the observed distribution of counts:

4 6
5 5
6 18
7 47
8 72
9 148
10 228
11 364
12 479
13 603
14 699
15 867
16 956
17 931
18 911
19 871
20 822
21 589
22 479
23 347
24 230
25 147
26 88
27 49
28 28
29 11
30 1
31 3
34 1

The observed mean & s.d. are 17.052 & 4.06113.
The theoretical mean & s.d. are 18 & 3.

If we were using the normal approximation to the binomial to test
the hypothesis that p = 1/2, we would reject at the .05 level
whenever |count - 18| > 6. If the test procedure is valid then,
when the A and B samples come from the same population (as they
did here), we should reject 5% of the time; i.e., there should be
500 rejections in 10^4 trials. Here is the observed distribution
of |count - 18|:

0 911
1 1802
2 1778
3 1456
4 1178
5 950
6 709
7 511
8 316
9 197
10 100
11 58
12 19
13 8
14 6
16 1

There are 1216 instances of |count - 18| > 6, for a rejection rate
of approximately 12%, which is 2.4 times the nominal rate of 5%.
I suspect that kernel density fitting would give even worse results.
.



Relevant Pages

  • Re: Abolishing The Grid (was Re: Question about play styles... Targeting AoE spells.)
    ... There was even a bit of 3D movement, since the PCs started out ... > calculate all the distances completely correctly. ... Occasionally we'll use a grid marked white board to help locating ... things, but in general, we don't bother with it at all. ...
    (rec.games.frp.dnd)
  • Re: Estimate if two bivariate sets are statistically different
    ... or "special" features of either sample distribution alone, ... because the set of potential features was not ... arbitrary grid construction. ... the interval and where the boundaries fall, ...
    (sci.stat.math)
  • Re: Abolishing The Grid (was Re: Question about play styles... Targeting AoE spells.)
    ... If you're careful and take the time to do it right you can ... I ran a session once without the grid. ... calculate all the distances completely correctly. ... The Warhammer method with tape measures and so on is fairer, ...
    (rec.games.frp.dnd)
  • Re: Help calculation distance
    ... > grid-coordinates is, that the differences in metric coordinate values are ... > very near to the actual metric distances between points. ... longitudes and the grid coordinates of the latitude and longitudes ...
    (sci.geo.satellite-nav)
  • Re: Setting a programmable thermostat
    ... thermostats that listened to a central time source such as GPS or WWVB ... and only operated heating or cooling cycles in certain 15 minute blocks ... set so that the overall distribution had half the thermostats operating ... reduction in peak electrical loads on the grid and help provide grid ...
    (alt.home.repair)