Re: Everything You Ever Want to Know about Distributions of Correlation Coefficients
- From: "m00es" <m00es@xxxxxxxxx>
- Date: 25 Sep 2006 02:36:57 -0700
Reef Fish wrote:
In the latest rounds I told m00es that when one of the X, Y variables
is dichotomous (value of 0 and 1 only), with his
test statistic S = r * sqrt(n-2) / sqrt(1-r^2 )
to test the null hypothesis Ho: rho = 0 , the sampling distribution
distribution of S is NO LONGER distributed as a T with (n-2) d.f.
It DOES have a t-distribution with n - 2 degrees of freedom. The
distribution of r under the assumption that rho = 0 is the SAME whether
the two variables are bivariate normal or when only one of the
variables is normal and the other may have any distribution or any
fixed set of values. See, for example:
Hotelling, H. (1953). New light on the correlation coefficient and its
transforms. Journal of the Royal Statistical Society, Series B, 15(2),
193-232.
Let me quote from page 196, where Hotelling gives the distribution of r
under rho = 0. "For the specific result with rho = 0 it is not
necessary to assume a bivariate normal distribution." One of the
samples needs to come from "a univariate normal distribution", while
the other "may have any distribution or any fixed set of values".
Since r has the SAME distribution under either condition, then r *
sqrt(n-2) / sqrt(1-r^2 ) will have the SAME distribution under either
condition. Therefore, r * sqrt(n-2) / sqrt(1-r^2) ~ t(n-2) when rho =
0.
I will give you another reference for this result. See:
Hogg, R. V., & Craig, A. T. (1995). Introduction to mathematical
statistics (5th ed.).
On pages 478-480, the authors derive the distribution of r under the
bivariate normal assumption and show that under rho = 0, r * sqrt(n-2)
/ sqrt(1-r^2 ) is distributed t(n-2). Now, on page 480, the authors
mention EXPLICITLY that a careful review of their proof reveals that
nowhere was it necessary to assume that the two variables are bivariate
normal. Only one of the variables must be normal.
The reference from Tate is:
Tate, R. F. (1954). Correlation between a discrete and a continuous
variable: Point-biserial correlation. Annals of Mathematical
Statistics, 25, 603-607.
What Tate does is derive the asymptotic distribution of the BISERIAL
correlation, which is something DIFFERENT than the the POINT-BISERIAL
correlation. Therefore, Tate is completely IRRELEVANT to this isse. It
is unfortunate that Tate put "point-biserial correlation" in the title,
because he is actually discussing the BISERIAL correlation.
Let me give you ANOTHER reference supporting my argument:
Kendall, M., & Stuart, A. (1979). The advanced theory of statistics:
Vol 2, Inference and relationship (4th ed.).
On pages 331-332, the authors discuss the point-biserial correlation
(and on the pages before that, they discuss the biserial correlation
and also give the result from Tate, 1954, which is irrelevant). Let me
quote from page 332:
"Apart from the measurement of correlation, it is clear from (26.77)
that, in effect, for a point-biserial situation, we are simply
comparing the means of two samples of a variate x, the y-classification
being no more than a labelling of the samples. In fact
r^2 / (1 - r^2) = t^2 / (n - 2),
where t is the usual "Students's" t-test for comparing the means of two
normal populations with equal variance. Thus if the distribution of x
is normal for y = 0, 1, the point-biserial coefficient is a simple
transformation of the t^2 statistic, which may be used to test it."
This is EXACTLY what I have been saying all along. Note that we can
take the equation above and rewrite it as t = r * sqrt(n - 2) / sqrt( 1
- r^2 ). Moreover, note how Kendall and Stuart also call x the
continuous and y the dichotomous variable. That's perfectly fine, since
they defined them as such. Moreover, they EXPLICITLY point out how
testing rho = 0 in this setup is the same as testing the equality of
the means.
Reef Fish wrote:
Of particular interest is that Kowalski (1972) pointed out in an
extensive historical survey, "A review of the literature revealed
an approximate equal dichotomy of opinion. For every study
indicating the robustness of the distribution of R, one could
cite another claiming to show just the opposite."
I have that paper right in front of me. Kowalski reviews the literature
that examines what happens when neither of the two variables is normal
and/or what happens when rho is not equal to zero. When rho = 0, then
the distribution of r is the same whether the two variables are
bivariate normal and also when only one of the two variables is normal.
Reef Fish wrote:
The paper of Hotelling (1953) was cited, but there's no mention
of the point-biserial correlation having a T distr. with (n-2) d.f.
No, Hotelling (1953) does not mention explicitly that r * sqrt(n - 2) /
sqrt( 1 - r^2 ) has a t distribution under rho = 0 when calculating a
point-biserial correlation. Hotelling (1953) discusses the fact that r
will have the same distribution under rho = 0 when the variables are
bivariate normal and also when only of the two variables is normal and
the other "may have any distribution or any fixed set of values". Since
the distribution of r is the same under both conditions, r * sqrt(n -
2) / sqrt( 1 - r^2 ) will have the same distribution under both
conditions when rho = 0. And that is t with n - 2 degress of freedom.
m00es
.
- References:
- Prev by Date: Re: The maximum safe difference between Population proportions
- Next by Date: Re: Coin flipping
- Previous by thread: Everything You Ever Want to Know about Distributions of Correlation Coefficients
- Next by thread: Coin flipping
- Index(es):
Relevant Pages
|
|