Re: XML format for statistical data and analysis?



John Mauer wrote:

> Is there an accepted format for statistical data and analysis, or even
> just for a table of data? Most of the references I've found seem to end
> around 2002. I feel like I'm in a time warp.
>
> I need to have a format that can be transferred and read by other
> software. Thanks for the help.

I do use XML for much of my data.
Currently my main data analysis is done with 'R', but the same data set can
end up in a spreadsheet, or almost any other software.

The main advantage is that you can include in the same file all sorts of
commentary about how the data was gathered, as well as a human-readable
and machine-readable justification of any data-items you wish to exclude
because they are outliers or whatever.

The second advantage is that the same data can end up in entirely different
form, using LaTeX tables, or HTML tables, for example. All this is done
on-the-fly; you just have to change the ONE primary data source.

Some of my data sources go back another step, and the primary data storage
is in a database. The software does not need to know that, because the
database generates XML by means of JDBC and ESQL.

Data validation can be by a 'belt and braces' approach; you get validation
against database constraints, against an XML Schema, and (by hand) against
yourself in an 'Exploratory data analysis' mood.

Most of this is entirely automatic, using 'Cocoon' as the main software
behind the scenes. The XML is converted to a virtual data-frame for
consumption by R. To 'R' it just looks like a tsv or csv file that lives at
a given url, and R can read its data frames from a url.

I do not (yet) use an 'accepted format' that you ask for. I just make it up
as I go along. Currently I try to distinguish 'real' data from
'computer-generated' data by using the element-names 'sample' or
'measurement' only for the genuine item; that is just me any my hobby
horse.

Oh, and
I don't type in the XML, its far too verbose; I use a csv or tsv format
which I discard as soon as the XML has been derived from it.

The nearest I might possibly get to an accepted 'standard' format is the
XML you get from Microsoft Access. Unlike other Microsoft XML it seems to be
reasonably free from code-bloat. Some other databases also have their own
XML facilities, and you may wish to use those; it may be important to have
there-and-back or triangular identity of both data and data-types between
the database, the statistics software, and the xml.

But what I use currently looks something like this
( example from 'Statistical Methods for Psychology' by David C Howell ):

<?xml version="1.0" ?>
<!-- data from 'CompositeFaces' -->
<Datasource recordheading="sample" >
<Heading >
CompositeFaces
</Heading>

<--! insert open-ended material here, using DocBook style elements
or LaTeX in CDATA
-->

<data>
<!-- Number of samples = 40 -->
<sample n="1" Score="1.2" SetSize="4" />
<sample n="2" Score="1.82" SetSize="4" />
<sample n="3" Score="1.93" SetSize="4" />
<sample n="4" Score="2.04" SetSize="4" />
<sample n="5" Score="2.3" SetSize="4" />
<sample n="6" Score="2.33" SetSize="4" />
<sample n="7" Score="2.34" SetSize="4" />
<sample n="8" Score="2.47" SetSize="4" />
<sample n="9" Score="2.51" SetSize="4" />
<sample n="10" Score="2.55" SetSize="4" />
<sample n="11" Score="2.64" SetSize="4" />
<sample n="12" Score="2.76" SetSize="4" />
<sample n="13" Score="2.77" SetSize="4" />
<sample n="14" Score="2.9" SetSize="4" />
<sample n="15" Score="2.91" SetSize="4" />
<sample n="16" Score="3.2" SetSize="4" />
<sample n="17" Score="3.22" SetSize="4" />
<sample n="18" Score="3.39" SetSize="4" />
<sample n="19" Score="3.59" SetSize="4" />
<sample n="20" Score="4.02" SetSize="4" />
<sample n="21" Score="3.13" SetSize="32" />
<sample n="22" Score="3.17" SetSize="32" />
<sample n="23" Score="3.19" SetSize="32" />
<sample n="24" Score="3.19" SetSize="32" />
<sample n="25" Score="3.2" SetSize="32" />
<sample n="26" Score="3.2" SetSize="32" />
<sample n="27" Score="3.22" SetSize="32" />
<sample n="28" Score="3.23" SetSize="32" />
<sample n="29" Score="3.25" SetSize="32" />
<sample n="30" Score="3.26" SetSize="32" />
<sample n="31" Score="3.27" SetSize="32" />
<sample n="32" Score="3.29" SetSize="32" />
<sample n="33" Score="3.29" SetSize="32" />
<sample n="34" Score="3.3" SetSize="32" />
<sample n="35" Score="3.31" SetSize="32" />
<sample n="36" Score="3.31" SetSize="32" />
<sample n="37" Score="3.34" SetSize="32" />
<sample n="38" Score="3.34" SetSize="32" />
<sample n="39" Score="3.36" SetSize="32" />
<sample n="40" Score="3.38" SetSize="32" />
</data>

</Datasource>


.



Relevant Pages

  • Re: Sane Syntax
    ... vital role in the future of TeX but we need some more human friendly ... Generating well formed LaTeX2e documents from XML ... Another approach is to convert existing documents to XML format and go ... TEI, together with DocBook, are the two ...
    (comp.text.tex)
  • Re: XHTML vs HTML
    ... to be the predominant type of HTML used on the web for many years yet. ... First, it is XML. ... XHTML is also ... transformed using XSL from and into virtually *any* other data format. ...
    (microsoft.public.frontpage.programming)
  • Re: Trying to parse a HUGE(1gb) xml file
    ... We typically see gzip compression ratios of 20:1. ... Sometimes XML is processed sequentially. ... If the data is just going to end up in a database anyway; ... I don't think anyone would object to using a native format when copying ...
    (comp.lang.python)
  • Re: text to bibliography?
    ... to xml: you can store binary data in an xml file. ... including your well-formattedbibliography(no longer in xml format). ... It is in annotated bibliographies (something Word 2007 does not ... that %I is actually the field representing the publisher. ...
    (microsoft.public.word.docmanagement)
  • Re: Moving from delimited to XML
    ... Recently I have started using XML in other areas and realize that this ... The difference between a CSV format and an XML format is that the ... person may have zero or more names, zero or more streets, zero or more ... You might also want to check out Exist, a XML database. ...
    (comp.lang.perl.misc)