Re: XML format for statistical data and analysis?
- From: Ken Starks <ken@xxxxxxxxxxxxxxxxxxxxx>
- Date: Thu, 15 Sep 2005 18:00:54 +0100
John Mauer wrote:
> Is there an accepted format for statistical data and analysis, or even
> just for a table of data? Most of the references I've found seem to end
> around 2002. I feel like I'm in a time warp.
>
> I need to have a format that can be transferred and read by other
> software. Thanks for the help.
I do use XML for much of my data.
Currently my main data analysis is done with 'R', but the same data set can
end up in a spreadsheet, or almost any other software.
The main advantage is that you can include in the same file all sorts of
commentary about how the data was gathered, as well as a human-readable
and machine-readable justification of any data-items you wish to exclude
because they are outliers or whatever.
The second advantage is that the same data can end up in entirely different
form, using LaTeX tables, or HTML tables, for example. All this is done
on-the-fly; you just have to change the ONE primary data source.
Some of my data sources go back another step, and the primary data storage
is in a database. The software does not need to know that, because the
database generates XML by means of JDBC and ESQL.
Data validation can be by a 'belt and braces' approach; you get validation
against database constraints, against an XML Schema, and (by hand) against
yourself in an 'Exploratory data analysis' mood.
Most of this is entirely automatic, using 'Cocoon' as the main software
behind the scenes. The XML is converted to a virtual data-frame for
consumption by R. To 'R' it just looks like a tsv or csv file that lives at
a given url, and R can read its data frames from a url.
I do not (yet) use an 'accepted format' that you ask for. I just make it up
as I go along. Currently I try to distinguish 'real' data from
'computer-generated' data by using the element-names 'sample' or
'measurement' only for the genuine item; that is just me any my hobby
horse.
Oh, and
I don't type in the XML, its far too verbose; I use a csv or tsv format
which I discard as soon as the XML has been derived from it.
The nearest I might possibly get to an accepted 'standard' format is the
XML you get from Microsoft Access. Unlike other Microsoft XML it seems to be
reasonably free from code-bloat. Some other databases also have their own
XML facilities, and you may wish to use those; it may be important to have
there-and-back or triangular identity of both data and data-types between
the database, the statistics software, and the xml.
But what I use currently looks something like this
( example from 'Statistical Methods for Psychology' by David C Howell ):
<?xml version="1.0" ?>
<!-- data from 'CompositeFaces' -->
<Datasource recordheading="sample" >
<Heading >
CompositeFaces
</Heading>
<--! insert open-ended material here, using DocBook style elements
or LaTeX in CDATA
-->
<data>
<!-- Number of samples = 40 -->
<sample n="1" Score="1.2" SetSize="4" />
<sample n="2" Score="1.82" SetSize="4" />
<sample n="3" Score="1.93" SetSize="4" />
<sample n="4" Score="2.04" SetSize="4" />
<sample n="5" Score="2.3" SetSize="4" />
<sample n="6" Score="2.33" SetSize="4" />
<sample n="7" Score="2.34" SetSize="4" />
<sample n="8" Score="2.47" SetSize="4" />
<sample n="9" Score="2.51" SetSize="4" />
<sample n="10" Score="2.55" SetSize="4" />
<sample n="11" Score="2.64" SetSize="4" />
<sample n="12" Score="2.76" SetSize="4" />
<sample n="13" Score="2.77" SetSize="4" />
<sample n="14" Score="2.9" SetSize="4" />
<sample n="15" Score="2.91" SetSize="4" />
<sample n="16" Score="3.2" SetSize="4" />
<sample n="17" Score="3.22" SetSize="4" />
<sample n="18" Score="3.39" SetSize="4" />
<sample n="19" Score="3.59" SetSize="4" />
<sample n="20" Score="4.02" SetSize="4" />
<sample n="21" Score="3.13" SetSize="32" />
<sample n="22" Score="3.17" SetSize="32" />
<sample n="23" Score="3.19" SetSize="32" />
<sample n="24" Score="3.19" SetSize="32" />
<sample n="25" Score="3.2" SetSize="32" />
<sample n="26" Score="3.2" SetSize="32" />
<sample n="27" Score="3.22" SetSize="32" />
<sample n="28" Score="3.23" SetSize="32" />
<sample n="29" Score="3.25" SetSize="32" />
<sample n="30" Score="3.26" SetSize="32" />
<sample n="31" Score="3.27" SetSize="32" />
<sample n="32" Score="3.29" SetSize="32" />
<sample n="33" Score="3.29" SetSize="32" />
<sample n="34" Score="3.3" SetSize="32" />
<sample n="35" Score="3.31" SetSize="32" />
<sample n="36" Score="3.31" SetSize="32" />
<sample n="37" Score="3.34" SetSize="32" />
<sample n="38" Score="3.34" SetSize="32" />
<sample n="39" Score="3.36" SetSize="32" />
<sample n="40" Score="3.38" SetSize="32" />
</data>
</Datasource>
.
- Follow-Ups:
- Re: XML format for statistical data and analysis?
- From: John Mauer
- Re: XML format for statistical data and analysis?
- References:
- XML format for statistical data and analysis?
- From: John Mauer
- XML format for statistical data and analysis?
- Prev by Date: Re: R install - Error Code 1 - on "make check"
- Next by Date: Re: Solutions - Statistical Inference Casella & Berger
- Previous by thread: Re: XML format for statistical data and analysis?
- Next by thread: Re: XML format for statistical data and analysis?
- Index(es):
Relevant Pages
|