Re: Estimate if two bivariate sets are statistically different



On Jul 15, 6:02 pm, Ray Koopman <koop...@xxxxxx> wrote:
On Jul 15, 12:14 pm, Ryan <Ryan.Andrew.Bl...@xxxxxxxxx> wrote:





On Jul 15, 2:14 pm, Ray Koopman <koop...@xxxxxx> wrote:
On Jul 15, 10:14 am, "m.s." <deviceran...@xxxxxxxxx> wrote:

Hi,

I have the following problem. I have two datasets that are
evaluated along two different variables. By a simple scatterplot,
it seems that dataset A follows a peculiar distribution
("horseshoe"-like, non-normal). Dataset B seems to be more
widespread and shifted towards higher Y values, but it's only a
few points.

What I would like to have is, for example, the probability
dataset B comes from a distribution like A , or any other
meaningful measure of the difference between datasets A and B.

I've found some concepts like the Mahalanobis distance or the
Hotelling T-square distribution, that could be useful, but these
seem to require the data are normally distributed in one sense
or the other. However I feel that there should be some kind of
general, obvious method.

Is there this method, and, if there is:
- where can I find info on that
- possibly, where can I look for a tutorial on how to
practically implement it?

My statistics knowledge is poor, so bear with me.

Thanks a lot,
m.

Hey Ray,

If it's okay, I have a couple of questions/comments
interspersed below your comments.

Do a scatterplot of the merged datasets, with
nothing to show which point came from which set.

1. Merge both datasets...

First dataset:
subj  var1   var 2
1      45    35
2      23    45
3      17    57

Second dataset:
subj  var1   var 2
4      27    31
5      29    36
6      41    48

Meged dataset:

subj  var1   var 2
1      45    35
2      23    45
3      17    57
4      27    31
5      29    36
6      41    48

2. Do a scatterplot

Partition the plane -- perhaps, but not necessarily, by
grid lines; if there appear to be natural breaks they may
be used -- into regions that are as compact as possible,

In other words, look at the scatterplot and see
if points tend to cluster together in regions?

When you create the regions, you want to "carve nature at its
joints" if there are any. You wouldn't want to put a boundary
through the middle of an obvious cluster if a minor shift could
avoid it. If I were doing it I'd probably start with some sort
of grid (not necessarily rectangular) and then adjust as needed.
Also, you want to attend to both variables approximately equally.
(E.g., you wouldn't want only thin vertical strips, because that
would ignore the y-variable.



subject to the constraint that each region
have at least  5(n1+n2)/min(n1,n2)  points,
where n1 and n2 are the two sample sizes.

I'm not calculating this formula correctly. Are you saying
that there should always be at least 10 points per region?

Each region will give rise to two cells in the 2 x R contingency
table. The expected frequency in each cell should be at least 5.
So yes, that means that each region should have at least 10 points
if n1 = n2, and more otherwise. (Remember, "expecteds should be at
least 5" is only a rule of thumb and should not be interpreted too
rigidly.)







This should be done by someone who knows nothing about either
dataset or how they might differ. Then do a 2 x R chi-square test
comparing the distributions of the two datasets over the R regions.

I'm having a hard time visualizing this. Let's say from the example
above that there are two distinct regions in the merged dataset.

Region 1 Points:

45, 35  = Dataset1
23, 45  = Dataset1
27, 31  = Dataset2

Region 2 Points:

17, 57  = Dataset1
29, 36  = Dataset2
41, 48  = Dataset2

So the 2X2 table would look like?

           Dataset1    Dataset2

Region 1      2           1

Region 2      1           2

If this results in a significant chi square, then this provides
evidence that each dataset has a different distribution.

Am I way off here???

No, you're got the idea.- Hide quoted text -

- Show quoted text -- Hide quoted text -

- Show quoted text -

Thank you.
.



Relevant Pages

  • Re: chess and grid computing
    ... problem of chess and grid computing. ... Distribution adds too much overhead. ... A grid is like a slowly communicating cluster. ...
    (rec.games.chess.computer)
  • Re: Separating data from cells
    ... I've added a colon after the "From". ... Subject: DISTRIBUTION ... select this group of cells and>COPY. ... (anywhere on the sheet where you're headings are, ...
    (microsoft.public.excel.worksheet.functions)
  • Re: Estimate if two bivariate sets are statistically different
    ... By a simple scatterplot, ... dataset B comes from a distribution like A, ... subj var1 var 2 ... Each region will give rise to two cells in the 2 x R contingency ...
    (sci.stat.math)
  • Re: Excel Macro Filtered Data in-place: cannot calculate Frequency of Visible Cells
    ... Set an object variable to the filtered cells and use this in your formula ... > Returns a frequency distribution as a vertical array. ... > distribution counts how many of the values occur in each interval. ... > Data_array is an array of or reference to a set of values for which you ...
    (microsoft.public.excel.programming)
  • Re: Random range averaging
    ... does it have to be exactly 4.2 or do you want the numbers drawn from a ... What is the distribution you are ... > I would like to have Excel generate a range of random integers in cells A1 ... > Steve C ...
    (microsoft.public.excel.programming)