Re: Estimate if two bivariate sets are statistically different



On Jul 15, 2:14 pm, Ray Koopman <koop...@xxxxxx> wrote:
On Jul 15, 10:14 am, "m.s." <deviceran...@xxxxxxxxx> wrote:





Hi,

I have the following problem. I have two datasets that are evaluated
along two different variables. By a simple scatterplot, it seems that
dataset A follows a peculiar distribution ("horseshoe"-like, non-
normal). Dataset B seems to be more widespread and shifted towards
higher Y values, but it's only a few points.

What I would like to have is, for example, the probability dataset B
comes from a distribution like A , or any other meaningful measure of
the difference between datasets A and B.

I've found some concepts like the Mahalanobis distance or the
Hotelling T-square distribution, that could be useful, but these seem
to require the data are normally distributed in one sense or the
other. However I feel that there should be some kind of general,
obvious method.

Is there this method, and, if there is:
- where can I find info on that
- possibly, where can I look for a tutorial on how to practically
implement it?

My statistics knowledge is poor, so bear with me.

Thanks a lot,
m.

Hey Ray,

If it's okay, I have a couple of questions/comments interspersed below
your comments.


Do a scatterplot of the merged datasets, with nothing to show which
point came from which set.

1. Merge both datasets...

First dataset:
subj var1 var 2
1 45 35
2 23 45
3 17 57

Second dataset:
subj var1 var 2
4 27 31
5 29 36
6 41 48

Meged dataset:

subj var1 var 2
1 45 35
2 23 45
3 17 57
4 27 31
5 29 36
6 41 48

2. Do a scatterplot

Partition the plane -- perhaps, but not
necessarily, by grid lines; if there appear to be natural breaks they
may be used -- into regions that are as compact as possible,

In other words, look at the scatterplot and see if points tend to
cluster together in regions?

subject
to the constraint that each region have at least 5(n1+n2)/min(n1,n2)
points, where n1 and n2 are the two sample sizes.

I'm not calculating this formula correctly. Are you saying that there
should always be at least 10 points per region?

This should be done
by someone who knows nothing about either dataset or how they might
differ. Then do a 2 x R chi-square test comparing the distributions
of the two datasets over the R regions.

I'm having a hard time visualizing this. Let's say from the example
above that there are two distinct regions in the merged dataset.

Region 1 Points:

45, 35 = Dataset1
23, 45 = Dataset1
27, 31 = Dataset2

Region 2 Points:

17, 57 = Dataset1
29, 36 = Dataset2
41, 48 = Dataset2

So the 2X2 table would look like?

Dataset1 Dataset2

Region 1 2 1

Region 2 1 2


If this results in a significant chi square, then this provides
evidence that each dataset has a different distribution.

Am I way off here???

Thanks!

Ryan

- Hide quoted text -

- Show quoted text -

.



Relevant Pages