Re: binomial 'association' measure?

From: Dan Bolser (dmb_at_mrc-dunn.cam.ac.uk)
Date: 10/14/04


Date: Thu, 14 Oct 2004 21:13:05 +0100
To: Graham Jones <grahamj@visiv.co.uk>

On Thu, 14 Oct 2004, Graham Jones wrote:

>In article <Pine.LNX.4.21.0410111922380.384-100000@mail.mrc-
>dunn.cam.ac.uk>, Dan Bolser <dmb@mrc-dunn.cam.ac.uk> writes
>
>>Here is my problem (after much rethinking the data / what I want to
>>ask)...
>>
>[...]
>
>>Over the superfamilies we can impose the 'taxonomic tree of life', i.e. a
>>distinct and given (immutable) hierarchical set of groupings for the
>>genomes. The 'root' of the 'tree of life' encompasses all genomes, and low
>>level groupings go all the way down to the individual species.
>>
>
>I'm not clear about this. You say the hierarchical set of groupings
>(which I take to mean a hierarchical clustering) is 'over' the
>superfamilies but 'for' the genomes.
>
>I assume you mean the groupings (clusters) are of genomes, because you
>seem to think it makes sense to look at one superfamily at a time. In

Yes, sorry, I made this mistake a couple of times and managed to correct
it in a few places. The above slipped through.

The hierarchical clustering of *genomes* is what I mean. I didn't use the
term hierarchical clustering because I wasn't sure if this was precicely
correct given the data, so I called the clusters 'groupings'.

The data is a tree. Each genome is a leaf, and there is only one
root. There is only one path from a leaf to the root (i.e. this isn't a
dag).

>other words, your data for one superfamily (let's say #42) looks like
>genome A has a assignments
>genome B has b assignments
>....
>
>plus
>
>some clusters like {A,D,H}, {B,C}, etc, at various levels in the
>hierarchy. Am I with you so far?

Yes, exactly correct.

>I am also unclear as to whether you want to look at a complete cut
>through the hierarchy, or just one cluster at a time.

The tree isn't very uniform. I was trying to think about cutting the tree
based on 'information content' of each node, as defined by the 'assignment
space covered' by that node. By this I mean neg log of the proportion of
the assignments which fall under that node.

NB by 'assignment' I mean specifically the superfamily assignments to the
genomes.

Cutting has the problem that I don't 'see' universally distributed
superfamilies, that is superfamilies which are best 'described' by the
root node.

>Finally, I am unclear as to whether you want to 'prove' something (which
>looks tricky to me) or just sift through your data in the hope of
>finding something interesting.

I hope to asign each superfamily to a particular part of the taxonomic
tree, like 'universal', 'mammal', 'bacteria', etc... This data will be
very usefull to me inorder to frame many questions about both superfamiy
and taxonomic evolution.

>[...]
>>Which higher level grouping (or groupings) *best* explains the observed
>>distribution of a superfamily over the genomes?
>>
>
>A catch here is that the lower the level of grouping, the better the
>explanation will be. The grouping {A}, {B}, {C},... will explain the
>data perfectly.

Yes, so somehow I need to minimize the model, and ofset that minimization
against error.

That is what I mean by *best*.

In some cases a list of genomes is acceptable, for example if a single
superfamily occurs in 5 very diverse genomes (diverse in terms of the
classification assigned to the genomes in the tree of life). One could
call this a 'universally distributed superfamily'. However, intuitivly I
think that category belongs to superfamilies which have a high number of
assignments to (almost) every genome.

It suddenly strikes me that I have been trying to solve this question for
the past 3 years.

Does the above help clarify what I want to ask?

Cheers,
Dan.



Relevant Pages

  • Re: binomial association measure?
    ... >level groupings go all the way down to the individual species. ... superfamilies but 'for' the genomes. ... I assume you mean the groupings are of genomes, ... through the hierarchy, or just one cluster at a time. ...
    (sci.stat.math)
  • Re: binomial association measure?
    ... So you see each genome can have superfamilies in common and distinct ... the genomes. ... Each superfamily has a certain distribution over the genomes. ... Is that distribution explained by one of the higher level groupings of the ...
    (sci.stat.math)
  • Re: binomial association measure?
    ... Put all superfamilies under 'bacteria'. ... "In some cases a list of genomes is acceptable, ... classification assigned to the genomes in the tree of life). ...
    (sci.stat.math)
  • Re: Junk Dna!
    ... That produces a completely different tree, ... I never said there was only one nested hierarchy. ... have a different regular scheme, or even a completely irregulary scheme, ... When you classify genomes into hierarchies. ...
    (talk.origins)
  • Re: binomial association measure?
    ... >> groupings) which explains the observed distribution ... >> of superfamilies. ... I think I want to do something inbetween hypothesis testing and ... I do not want to group genomes or superfamilies. ...
    (sci.stat.math)

Quantcast