Re: Need to aggregate t-test stats



Here's the setup: My main goal is to "bless" a new
database (db) such
that "blessing" means that the new database isn't
"statistically
different" from the old db. One way I thought of
doing this is to
compare the string length (i.e. number of letters)
for a given field
with respect (w/r) to each db to see if they were the
same. However,
I know the database was created by combining data
from multiple
different sources and each record in the db has the
source-origin
info. So, at least to me, it makes sense to have a
strong test that
is based off of testing the consistency of the two
dbs string lengths
for each different source. That is to say, I am
partitioning the two
samples into multiple sub-samples where each sample
has all the
records from a different source.

Rather than continue describe this is painful
abstract terms, I will
make this concrete:
Imagine that both data bases have a "name" field.
The names are
collected from different countries (i.e. "source").
It may be that
one country messed up their data entry and so I want
to test the
"name" lengths for each country. So, for instance,
both databases
could have some names from England, China, and
France, etc (hundreds
of countries) and lets say that just the french
source screwed up
their data entry.
I can do a two-sample t-test to compare the length of
English names in
db1 against db2, then another t-test for the Chinese
names, and
likewise for the french, etc. Now I have hundreds of
t-test p values
but I need (like?) to combine the results to give a
final answer. The
"french t-test" has a low p-value but hey, there are
hundreds of tests
so that could be a (statistically insignificant)
coincidence.
Note, the numbers of records from a given country
and from each db
can be different (and so each t-test has a different
t-distribution
given that different sample sizes ==> different d.f.)

If these tests were all homogeneous (measuring the
same quantity) I
might use Bonferonni or the like but that's not the
case.
Furthermore, I simply don't like the idea of lowering
the alpha's on
each t-test when, if the db's were the same (i.e. my
Ho null overall
hypothesis), I would expect a distribution of values
where must of
them would have high (safely Ho) p-values (seems like
a great way to
get false negatives). In fact, following that logic,
my intuition
suggests that the "right" thing to do is to use the
set of t-
statistics obtained from all the different (source)
tests and compare
that to a t-distribution; if it fits then even if
some p-values
violate a typical per-t-test confidence limit I
wouldn't care (e.g. if
you run a million tests, a single p-value of 0.01
doesn't say anything
by itself but if most of them were 0.01...).
Anyway, the above is a non-starter because, as
mentioned above, t
distributions are a function sample size/degrees of
freedom and those
are different for each test (country). So, I can't
from a single
sampled t-distribution.

Another possibility that intuitively feels promising
is to somehow
work with the final p-values. After all, regardless
of the underlying
statistic (T vs. <put your stat here>), given a
certain number of
tests there should only be so many low p-values.
Unfortunately, I
don't yet see how to put that into practice (assuming
it's even
sensible!).

Maybe I can put this into a two-way anova framework.
That is, each
mean should be the same as it's counter part in the
other db where the
influence of the source has been factored.


The multivariate analog is the Hotelling T^2 test, which is proportional to an F under H.

Another approach if you can assume independence among the country results, is to use the Fisher combination of experiments rule.

-2*Sum(ln(Pi)) ~ X^2

where the degrees of freedom of the chi-square is twice the number of countries.

Jack
.



Relevant Pages

  • Need to aggregate t-test stats
    ... My main goal is to "bless" a new database such ... "name" lengths for each country. ... them would have high p-values (seems like a great way to ... statistics obtained from all the different tests and compare ...
    (sci.stat.math)
  • Re: Access unter Citrix - Probleme beim Komprimieren
    ... Microsoft Jet Database Engine Programmer's Guide, ... The optimizer is one of the most complex components of the query engine. ... uses statistics to determine the most efficient way to execute a query. ... Each time you prepare a request for information, ...
    (microsoft.public.de.access)
  • Re: Access unter Citrix - Probleme beim Komprimieren
    ... Hat mir doch keine Ruhe gelassen, war ja auch nicht in MSDN oder KB, sondern in der TechNet drin: Microsoft Jet Database Engine Programmer's Guide, Chapter 4. ... The optimizer in the Jet database engine is a cost-based optimizer, which means the optimizer assigns a cost to each task and then chooses the least expensive list of tasks to perform that will generate the desired result set. ... The algorithms that the optimizer uses depend on the accuracy of the statistics provided by the underlying engine. ... Each time you prepare a request for information, either by sending an SQL statement as an argument to the OpenRecordset method of a Database object or by saving a QueryDef object in your database, Microsoft Jet runs through a complex series of analysis and optimization steps. ...
    (microsoft.public.de.access)
  • Re: How to properly link the tables?
    ... Year Country Datacategory VariableName VariableValue ... Thanks very much on your previous reply enabling a entire view in database ... table is devided up by time/country/company category, ...
    (microsoft.public.access.tablesdbdesign)
  • Re: Basic application design
    ... Database is in one country where the internet is excellent. ... If we go for an ASP net based solution, then the server hosting ... Web Service running on ASP.Net. ...
    (microsoft.public.dotnet.languages.vb)