Re: ping Jörg: on what 'raw data' is
- From: Jon Kirwan <jonk@xxxxxxxxxxxxxxxxxxx>
- Date: Fri, 11 Dec 2009 18:28:24 -0800
On Fri, 11 Dec 2009 17:57:52 -0800, Charlie E. <edmondson@xxxxxxxx>
wrote:
On Fri, 11 Dec 2009 17:18:49 -0800, Jon Kirwan
<jonk@xxxxxxxxxxxxxxxxxxx> wrote:
On Fri, 11 Dec 2009 16:05:52 -0800, Joerg <invalid@xxxxxxxxxxxxxxx>
wrote:
Jon Kirwan wrote:
Hi. I just came across (for entirely different reasons) two articles
by someone I very slightly know, Bob Grumbine. His blogs neatly
address the discussion you and I had regarding raw data.
This one nicely discusses _some_ of the problems:
http://moregrumbinescience.blogspot.com/2009/11/data-set-reproducibility.html
As does this one:
http://moregrumbinescience.blogspot.com/2009/11/where-is-surface.html
On both links, a poster, EliRabett, makes some interesting comments.
On the first, EliRabett writes, "Any auditor who demanded all of the
data would be fired summarily. A good audit provides reasonable
certainty that the records are in good shape without tying up the firm
forever. McIntyre's 'audit' demands all of the records at the start.
The purpose is to burden the scientist. He then yells and screams
about every little dot and jot almost 99% of which is due to his not
understanding what was being done. At the end maybe one or two points
remain. October's Briffa Fest is an excellent example."
On the second, EliRabett writes, "What the surface of a liquid is, is
by itself an interesting question. With the exception of liquids that
have vanishing vapor pressure such as gallium and mercury (at room
temperature), it is hard to say, as the molecules in the vapor can
interact with the molecules in the bulk over several atomic distances.
Ron Shen did a lot of early work on this with a very imaginative
technique. Since the molecules on the surface are in an anisotropic
environment, only they can participate in non-linear sum frequency
generation." And he then goes on to add an interesting comment from a
paper that I won't quote here. But if you don't see the relevance of
EliRabett's comment to the blog written by Bob, then you missed
understanding some of the difficulties that face instrumentation
designers and those scientists who must then also interpret their raw
data.
Regardless of your view, at the very least I hope you find these two
blogs interesting to read.
They are interesting. However, I don't think Eli has ever witnessed the
audit of a business. I have, many times. The auditors used me at times
to translate stuff for them because I spoke Dutch and they didn't on
days when the South African auditor was not there. I was amazed how fast
and how complete they crunched through tons of data.
I'm not going to argue that. I have no real knowledge here. I wasn't
thinking so much about the details of his being wrong about some other
specialty, as about the general thrust. Earlier, I talked about the
fact that scientists replicate, not by duplicating the steps of
another, but instead by being creative about another cross-cutting
approach that attacks a similar problem. His comments, though they
don't specifically say so, are really addressed in that vein.
Furthermore, do we know McIntyre demanded _all_ the data at once? Many
other sources say very different things. Just one example:
http://jennifermarohasy.com/blog/2009/08/raw-temperature-data-no-longer-available/
Again, I was more focused on having you read Bob's article. I
selected Eli's, only as a segue to the article.
Quote: "... Steve McIntyre was denied access to specific data files at
the Climate Research Unit ...". Specific does not mean all.
I wouldn't know.
Secondly, the assertion that they do that to "burden the scientist" is
not a proper fact statement, it is a clear case of judgement and the
writer had better back that up. I don't see where he did. That lessens
my interest in a certain piece of writing rather dramatically because in
my eyes it makes it lose credibility, whether it's an answer to a blog
or whatever.
Well, we are talking cross-purposes.
Now tell me what you think about what Bob had to say in those articles
rather than focusing upon Eli. You make me sorry I even brought him
up.
We had earlier discussed the concept of 'raw data.' Bob talks a
little about that subject. What did you think?
Just drop Eli into a bucket for all I care.
Jon
Hi Jon,
Hope you don't mind my stepping in.
There was a lot of good information in that blog, much of it more
telling than you might think because of the assumptions behind it.
I actually enjoyed reading them. I thought he was a little "too anal"
about the computer systems' details. The issues, though they exist,
aren't usually important (unless you have an older Intel processor
with the FDIV problem ;) and the right processing involved.
First, there were all the data sets, many stored in a binary format
that was probably unique to this researcher. He worried about being
able to use that data later, as he could easily forget how to
translate it to a usable format if he lost or used the incorrect data
translator. The data was represented by the blogger as already
'evaluated' to throw out outliers and probably bad data.
Yes. I took his comments to mean that he used a program with
hopefully repeatable criteria, though. I had worked with a heart
specialist, many years back. He told me a story of another heart
researcher he knew who would ask his assistants to take readings from
patients and log them. However, when deciding to include or exclude
the data from being subjected to analysis, this researcher he was
telling me about would say, "Hmm. This can't be. That assistant
screwed up, I'm sure." And would toss out the data point. It was an
"on the spot" decision that could NOT be replicated, even by the same
man at two different times. That's a problem. What is necessary,
even if wrongly done (as was the case for NASA satellite data
regarding ozone depletion data from satellites from about 1980 to
1985), are repeatable methods that do NOT depend upon the mood of the
day of some human researcher. That's not a panacea, of course. But
it is a move in the right direction to move away from human vagaries.
What ever
original observations they were based on didn't seem to be kept.
I actually read his comments to say that he, personally, does try to
keep them. However, he isn't always sure that each and every slight
modification is kept. I've been there many times, myself. I make a
slight change because I find a "bug" in the way I process the data and
I correct that bug. There is no question in my mind that the previous
case was in error, so no question about applying the change. But I
may have previously presented a dataset to someone else based upon
that earlier analysis. Usually, since we are talking about what
amounts to averages (integrals, in effect) and not first or second
derivatives of the data, the effects are almost invisible. And in the
heat of moving from a point of more ignorance towards better knowledge
on the fast pace that I like to achieve for those depending on the
results, I may not even remember _if_ I filed a dataset from the
earlier run. This is where log books often save my ever living ass. I
can keep a paper diary of my work and by reviewing that I can usually
uncover what's what. But if you asked me to go do that for work I did
8 years ago, to haul out old books and old files, I would probably not
take kindly to it unless _you_ were someone I really cared about or
there was a convincing _new_ reason to question the results.
This
concerns me, as that means we are never dealing with 'reality' (i.e.
the actual measurments) but instead we are working on a representation
of the representation. Every data set is actual a filtered data set,
with the filters being assumed to be correct. Those filters should be
the first thing documented, and should be very reproducible.
I think he addresses this well in his writing. But let me wander for
a moment.
ALL measurements... and I mean ALL!... are interpreted through the
light of theory. Without theory, all numbers are just random noise.
What does "six volts" mean? Well, it depends on the context and the
device doing the measurement and means of reading it and ... And each
of these depend upon prior, more prosaic theory. When a volt meter
reads out 6V, we can accept the reading if the unit has been recently
calibrated. But what is the traceability of that? Worse, the very
act of taking the measurement depends on a host of prosaic theories,
all of which themselves depend on still more prosaic theories and
still other measurements. Ultimately, this goes all the way down to
the very axiomatic basics of mathematics. Nothing stands by itself
and there is always room for criticism at some facet or point.
To be meaningful, raw data must be subjected to the light of theory.
As much theory as is appropriate. Theory is what gives meaning. It's
that simple. And that complex.
Scientists deal with theories, not reality. No one on this planet
today knows what reality is, or even if anything we do touches even
the slightest upon it. Do atoms exist? Science doesn't say. However,
it does say that we can predict well if we assume that atomic theory
applies and if we apprpriately deduce it to the circumstances at hand.
I don't find it at all surprising or worrisome whether or not data is
about 'reality.' What I care about is whether or not the measurements
were made well, according to existing highly predictive theories, both
prosaic and otherwise, and that interpretation is made according the
best we know how to do.
All data is interpreted through the light of theory. No shock to me.
Next, there were all his concerns about platform. Now in software
design, this is a common problem, and many of the comments were about
how these problems have already been solved in professional software
circles. My concerns here were that the algorithms seemed so platform
sensitive, requiring exact builds to get the same results. if your
algorithms are that platform sensitive, it indicates serious stability
problems in those underlying algorithms.
Actually, I think he pointed out that the algorithms were NOT
sensitive, so far as he was aware. Though he pointed out a case where
the numbers did differ (I gathered, in only very minor -- downright
binary in the least significant places) and they spent time tracking
that down and solving it. But that it didn't change the "results."
Which is as it should be.
Finally, on auditing, an auditor needs access to ANY data he desires,
and it is a huge red flag if some is not available. Now, a business
auditor doesn't review every single data point, but does review a very
large and carefully sampled set of them. When there are parts not
available, it becomes pretty much impossible to verify anything.
Well, again. That's just Eli's comment and, frankly, doesn't change
Bob's. So I'll leave it.
Jon
.
- References:
- ping Jörg: on what 'raw data' is
- From: Jon Kirwan
- Re: ping Jörg: on what 'raw data' is
- From: Joerg
- Re: ping Jörg: on what 'raw data' is
- From: Jon Kirwan
- Re: ping Jörg: on what 'raw data' is
- From: Charlie E .
- ping Jörg: on what 'raw data' is
- Prev by Date: Re: OT: 1,700 UK scientists back climate science
- Next by Date: Re: OT: 1,700 UK scientists back climate science
- Previous by thread: Re: ping Jörg: on what 'raw data' is
- Next by thread: Re: ping Jörg: on what 'raw data' is
- Index(es):
Relevant Pages
|