Re: How to Compare the Two Non-Normal Datasets?



If I understand you correctly, the (-200, +200) interval contains
about 95% of the 5*10^5 observed differences. If so, then the interval
for the mean should be about (-200, +200)/sqrt(5*!0^5) = (-.28, +.28).

Grouping the data and doing some sort of analysis of variance might
help you identify factors that affect agreement.

On Mar 17, 2:16 am, qqu...@xxxxxxxxxxx wrote:
Thank you both, Ray and Rich, for your replies and for correcting my
understanding/expression.

Yes, I am more interested in the difference as they are naturally
paired. When I sort the difference data [(Xi - Yi), i = 1, 2, ..., n]
into equal-size bins of 10, the distribution does resemble a mound
shape, though it is not symmetric and does not vanish on both sides
as quickly as you would wish.

The minimum of the difference is around -900 and maximum around +900.
The actually 95% interval is roughly (-200, +200). I certainly wish
they can be 10 times or even 20 times narrower than that.

And yes, I can divide the data into groups according to locations,
time, and a number of other factors. And it can become easier to do
the analysis group by group. And I wish the variability in a group
is caused only by some stochastic factors.

Nonetheless, can I treat all the data as just one single huge group
and do more than just computing the mean of the difference and a
confidence interval (such as a 95% one)?

Thank you again!

--Roland

On Mar 17, 2:52 am, Ray Koopman <koop...@xxxxxx> wrote:

On Mar 16, 6:55 pm, qqu...@xxxxxxxxxxx wrote:

When I say "AGREE", I mean accepting the null hypothesis that
the two datasets have the same mean based on a test at a given
significance level (say alpha).

Unfortunately, significance tests don't work that way. When you
reject a hypothesis at a given alpha level you know that the
probability of being wrong is at most alpha, but when you fail
to reject you do not know what the probability of being wrong is.

The best you're going to be able to get is a confidence interval
for the mean difference. And as Rich has pointed out, for paired
data it's the differences that are assumed to be normal.

To follow up on Rich's comment about df: can the 500K pairs be
grouped? You mentioned "dozens of locations" and "different times".
Should the data be organized in a Location by Time table, or
something even more complicated, rather than just an unstructured
one-dimensional list?
.



Relevant Pages