Re: Approximation of correlations

From: Richard Ulrich (Rich.Ulrich_at_comcast.net)
Date: 08/29/04


Date: Sun, 29 Aug 2004 15:07:04 -0400

On Sun, 29 Aug 2004 14:34:55 GMT, Lou Pirog <lpirog@comcast.net>
wrote:

> Rich and Ray, thanks to both of you for your responses.
>
> Ray, I will follow up on the eigenvalues and establishing a criteria to
> maximize (like the determinant. Are there other examples of criteria?). It
> sounds like this approach could establish some boundaries on the missing
> values.
>
> Rich, the question is most definitely still open. You're exactly right
> in your determination of what information is currently available
> (i.e., B sets of VxV matrices and V sets of BxB matrices). What's
> ultimately needed is a (BxV) x (BxV) matrix (as Ray had placed at the end
> of his note) where each row/column intersection is for a particular pair
> of B/V combinations, say B(1)/V(5) and B(2)/V(3).
>
> I've added
> comments/questions to portions of your reply below:

Okay, I'm better oriented, but I'm not sure yet what
Ray was writing about, or if you and I are on the same page.

Here, I will expand on what I was saying about correlations
in discriminant function - in case that fits.

Last night, I wondered what meaning there could be in
certain cross-'correlations' but I see that I had already
given one specific context for understanding that, from
the D.F. example. I will show that, below.

I'm snipping the rest of the post, which raised some questions
one at a time. Also, I am borrowing Ray's layout for
correlations, since you are using it, too.

Here is Ray's layout, though I have changed some entries,
and I offer a different exposition. (Use Fixed font to view.)

Businesses X, Y.
Variables a, b, c.
Dots "." represent symmetrical entries.

        X Y
     a b c a b c
  a 1 r r 1 r r
X b . 1 r . 1 r
  c . . 1 . . 1

  a . . . 1 r r
Y b . . . . 1 r
  c . . . . . 1

I'm considering the model of two-group discriminant
function (DF), and then a bit more.

Consider in the table the intersection of X-X and Y-Y.
It is simple to say that these can represent the
correlations within the subsamples X and Y.

The "within-groups" correlation matrix that SPSS gives
is then *one* thing that could be denoted as the
'intersection' of X and Y, representing the pooled, separate
correlations. DF does its pooling of the sums of squares
and cross-products, if I recall correctly. The SS assume
a common variance, and use the separate means. If the
correlations are zero in each group, then the pooled
correlation will also be zero, despite an overall
non-zero correlation that could be induced by differences
in means.

(There are different notions of pooling that could be
used for unequal Ns and unequal variances. Just average
the correlations? I will skip that complexity here.)

The "total-groups" correlation matrix is another thing
that could be denoted by the 'union' of X and Y, achieved
by concatenating the two groups and computing r's. Group
differences can induce correlations: For instance,
height and vocabulary may have small correlations among
students 8 years old or those 16 years old. However, the
total pooled set shows that the taller students know more
words -- thus, a large r.

Now, the differences in means are also computable as r's,
of a sort, on the t-tests between groups. Those t's or r's
are what account for the differences between the two versions
of correlations that I just described. The DF analysis
is satisfied to present the univariate tests between the
groups, without ever bothering to generate what might be
called - by analogy with ANOVA - the "Between-groups"
correlation matrix. That would be defined, I imagine,
in some fashion by subtraction of the other two matrices,
probably using sums of squares rather than r's.

It probably could be computed alternatively by using the
two r's or R-squareds from the simple tests on means, along
with one matrix of the two, using formulas for partial
or multiple correlation. But it seems more compact and
more intelligible - for most purposes - to show the within
matrix and the total matrix; and then using the t-tests,
instead of generating the between matrix, which encodes
the amount of confounding that exists in the pooled r's.

How is this extend to multiple variables? DF uses a
simple within-pooling of multiple groups, and shows the
univariate tests. I think SPSS does not show the original
r's, but that could be an option of any package's presentation.

The original post could be asking for (as I see it) a
similar set of computations that I have just described,
done for each and every pair of groups. I wonder about
what it could be needed for.

I'm curious as to how much of this is helpful -

-- 
Rich Ulrich, wpilib@pitt.ed
http://www.pitt.edu/~wpilib/index.html