Re: Logistic Regression
- From: clemenr@xxxxxxxxxx
- Date: 29 May 2005 04:02:06 -0700
G Robin Edwards wrote:
> In article <1116979574.815048.167770@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>,
> Ray Koopman <koopman@xxxxxx> wrote:
> > Even an ordinary least-squares linear regression will completely
> > separate the 0s and 1s, so the usual likelihood-maximizing logistic
> > iterations should not converge.
>
> He was commenting on the data supplied by Ross-c concerning an
> examination of writing style.
>
> Out of curiosity (because I don't understand all of what's already been
> posted on this thread!) I've simply looked at the x data columns
> individually, in pairs and as a set.
>
> For single columns, x6 has one item (#5) that is far removed from all
> the others. x4, row 5 is also noticeably adrift from the remainder.
> Presumably there are good technical reasons or expectations for this.
The numbers are the frequencies of words and punctuation symbols in
text. One reason for outliers is a concious decision of the author to
use a particular writing style for a particular piece of text. For
example it is argued (both for and against) that a particular poem
"Shall I Die" may or may not have been written by Shakespeare. This
poems is about 500 words long, and never uses the word "the". "The" is
such a common word that if it's absent from a 500 word poem, this is
almost certainly a deliberate choice. The topics of books might also
affect usage. E.g. a book from the viewpoint of an animal (not
Disney-style) might show a distinct lack of quotation marks.
> Pairwise, x11 is not correlated with any other x, nor is x6
I haven't looked into correlations between word usage carefully. I'm
not sure exactly what other people may or may not have done. However,
it would probably be better to use larger amounts of data. If you
actually wanted to look into which words are, or are not, correlated
with each other I can provide you with much larger data sets.
> Highly correlated column pairs are 13,17 3,5 2,9 and 2,10 whilst y
> correlates well with x1, x2, x3, x5 and x9.
Again, I can send you data for 870 texts and about 120,000 attributes
if you want. Most of these attributes will be 0 for most texts.
> Computing VIFs for this group gives values between 5.8 and 15.2, so
> there's clear evidence of multiple correlations that might be a
> hindrance to standard multiple regression analysis. Choosing a large
> set of the x columns produces very large VIFs.
I have played around with using a Mahalanobis distance based K-Nearest
Neighbours classifier. It didn't work very well in comparison to other
approaches. Using Euclidean distance instead of Mahalanobis distance
gave higher performance. This was the case even when there were
considerable variations in the amount of correlation between variables.
> Thus I wonder what could be expected from inferences from the advanced
> techniques that Ross-c has used. Might they also be compromised in the
> way that naive regression would probably be?
I'm going to try logistic regression on the principle components of the
data. So far I haven't "gotten around" to doing much with PCA and
authorship. In the past decade or work on authorship attribution using
PCA was popular, as an example the work of John Burrows of Newcastle
University in Australia. Professor Burrows himself has moved away from
the use of PCA, and invented a technique called Delta, which uses a
simple sum of absolute differences of z-scores for word frequences to
measure the distance between texts.
I haven't been able to think up experiments to investigate this, but I
suspect that PCA is not the right approach for dimensionality reduction
in authorship analysis. When I did some rough experiments on the
principle components of word length frequences (not change of data) on
text, there seemed to be little relationship between the proportion of
variance explained by components and the results I got from crude
information theoretic measures of how good components were at
distinguishing data. I suspect (wildly conjecture) that unlike the
textbook exmples of PCA where the components we're looking for are the
major factor affecting the results (e.g. student quality and course
difficulty in explaining exam results), variations due to author style
are only a minor factor in explaining the frequency of words in text.
> I'm merely asking for enlightenment, of course - not criticising or
> trying to help!
Actually, I don't mind criticism of any kind.
> I also noted in passing that the x data set (x0 excluded of course)
> seems to have only 3 principal components greater than 1. Might this
> be expected of this type of data?
I did a quick hack with a larger set of data. 5 attributes, but from
870 texts by 87 authors, and go the following:
x <- read.csv( "bdata.csv" )
> x.pca <- princomp( x )
> summary( x.pca )
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4
Standard deviation 0.01915717 0.01172167 0.009182022 0.006375846
Proportion of Variance 0.56204629 0.21042055 0.129117774 0.062256548
Cumulative Proportion 0.56204629 0.77246684 0.901584612 0.963841160
Comp.5
Standard deviation 0.004859063
Proportion of Variance 0.036158840
Cumulative Proportion 1.000000000
Ooops! Got to go out now. Must finish.
Cheers,
Ross-c
.
- Follow-Ups:
- Re: Logistic Regression
- From: G Robin Edwards
- Re: Logistic Regression
- References:
- Logistic Regression
- From: clemenr
- Re: Logistic Regression
- From: Phil Sherrod
- Re: Logistic Regression
- From: clemenr
- Re: Logistic Regression
- From: Ray Koopman
- Re: Logistic Regression
- From: G Robin Edwards
- Logistic Regression
- Prev by Date: Re: significance of skewness
- Next by Date: Re: How two probabilities are related
- Previous by thread: Re: Logistic Regression
- Next by thread: Re: Logistic Regression
- Index(es):
Relevant Pages
|