[AI] Massive science experiments pose data storage problems;

Sanjay ilovecold at gmail.com
Wed Jan 9 10:37:52 EST 2008


 Massive science experiments pose data storage problems; With
          ever more data being produced, it is critical to save
          it and preserve the software and hardware to access it

Paul Marks

"WHAT hath God wrought?" These are the words Samuel Morse sent in
1844 in the first telegram. We know this because the telegram
itself sits in the US Library of Congress. The same cannot be
said for the first email. Sent in 1971 by computer programmer Ray
Tomlinson, he thinks it probably contained the first line of
letters on a computer keyboard - "qwertyuiop". It was not saved,
so we'll never know for sure.

The loss of a nonsensical email may seem trivial, but it
highlights a looming issue: how will we preserve the huge amount
of data produced by science experiments today in a way that
guarantees it will be accessible in the future?

Losing scientific data is nothing new. "Many space projects from
the 1970s, both at NASA and the European Space Agency, are either
lost or cannot be read with current computers and software," says
Peter Tindemans, an adviser on archiving technology to the
Netherlands government. "Science's funding bodies have not paid
for long-term storage repositories."

Now, with ever more data being produced, saving it is critical.
"Scientific data sets are becoming enormous," says Alexis-Michel
Mugabushaka, a policy adviser with the European Science
Foundation in Paris, France. "Saving them has to be a priority
for publicly funded research." The results of collisions inside
particle accelerators, for example, questionnaires filled in by
people taking part in clinical trials, and environmental readings
taken by distributed sensor networks are not merely historical
curiosities like Tomlinson's email. Scientists need to be able to
get at them in order to perform new analyses. They may also want
to scour the data for clues that the original researchers missed.
Stored data could even be used to rerun experiments to check for
signs of error or fraud.

The Large Hadron Collider (LHC) at CERN in Geneva, Switzerland,
illustrates just how daunting the problem can be. In May, it is
due to begin smashing high-energy protons together in a bid,
among other things, to discover the elusive Higgs boson, a
particle thought to be responsible for endowing matter with mass.
Sensors in the 27-kilometre circumference machine are expected to
generate 450 million gigabytes of data over its 15-year lifetime,
enough to fill 640 million CDs. The raw data will be stored on
discs and tapes and converted into a more accessible format which
can be made available to researchers via a grid of 100,000
computers around the world. Despite the magnitude of the project,
CERN has no idea if it will have the cash or technical resources
to preserve these data sets after the particle smasher has fired
its last proton beam in 2023.

Even if the raw data survives, it is useless without the
background information that gives it meaning. "The data needs to
be stored in a digestible, understandable form and be available
forever," says Jos Engelen, CERN's deputy director general. "But
we just don't have a long-term archival strategy for accessing
the LHC data." A $90 million slice of the LHC's $6.5 billion
budget has been allocated to processing and storing it, but that
only covers the years of the LHC's operation.

With luck, help will soon be on the way. Scientists and engineers
from around the world met at a conference in Brussels, Belgium,
on 15 November to thrash out which technologies and policies -
and even which human behaviours - will best preserve critical
data generated by Europe's scientists. In the US, the National
Science Foundation (NSF) is planning to spend $100 million
setting up and running up to five trial repositories for publicly
funded research data, and in Australia a government-backed body
wants to see a similar project established.

As well as providing money for storage, the NSF project, known as
DataNet, is on the lookout for new techniques for storing data.
"We do not believe any organisation is already providing the kind
of data preservation capability that we have in mind," says Lucy
Nowell, director of cyber-infrastructure projects at the NSF in
Arlington, Virginia.

Unlike existing repositories such as web search engines, which
continually update their indexes of web pages, an archive for an
experiment like the LHC must store data over a long time and
therefore hold copies of not just the data but also examples of
the software and hardware used to capture and access it. "Google
has massive data centres, but its emphasis is on current use and
analysis of the data, not on its preservation for decades to
come," Nowell says.

Most data storage media have a limited shelf life and eventually
degrade, so DataNet researchers will also study how to move
massive data sets from one storage medium, such as tape, to
another, such as hard disc . Although technologies exist for
migrating small amounts of data, large repositories require new
methods to ensure errors do not creep in.

Repositories open to future generations of scientists will also
require the scientists who deposit the data to take account of
who might have access to it years later. For example, privacy
will be an issue when filing to an archive that could be viewed
by any number of future scientists, says Nowell. "Scientists will
need to protect patient privacy in clinical trials data, working
out what types of data people should have access to and under
what conditions. They will also have to protect scientific data
from manipulation based on profit or political motives." The
NSF's DataNet project aims to iron out such behavioural issues by
coming up with best-practice guidelines.

In Australia, the incoming Labor government will soon be
considering a plan for what has been dubbed the Australian
National Data Service - an initiative proposed in October by the
eResearch Infrastructure Council. ANDS will also establish a
national network of research data repositories.

Similar efforts are planned for Europe. The European Commission
offers funds for research but not for operational costs. A lobby
group has recently been formed that plans to persuade European
politicians that about 2 per cent of each research grant should
be earmarked for long-term archiving. Called the Alliance for
Permanent Access (APA), it includes representatives from CERN,
the European Space Agency, the Max Planck Society in Germany, the
European Science Foundation, the UK's Rutherford Appleton
Laboratory, libraries and a raft of scientific journal publishers.

As well as securing money, the APA, like DataNet, is also focused
on studying new methods for digital preservation. Disc drives for
archiving need careful engineering, says the APA's technology
spokesman, David Gerietta of the Rutherford Appleton Lab. One
flipped bit in a cosmological data set could render it useless,
so drives must use smart self-checking routines. A system called
the Intelligent Rule Oriented Data System (iRODS) at the San
Diego Supercomputer Center in California already does this to
monitor bit flips in simulations that it carries out. It starts
by saving a condensed version of the data. When the data is
checked, it computes this "checksum" again and checks it against
the saved copy. If a bit has flipped, the checksums won't match.
Gerietta hopes to adapt iRODS to check for bit flips in large
archives.

Another problem for archivists comes from open source software,
which is popular with scientists because of its low cost and the
ability to modify it to suit the needs of a particular
experiment. If part of an experiment uses an open-source program
for capturing data, there is no guarantee that it will still be
available on the web at a later date, or won't have changed
significantly. The APA says that scientists archiving data will
also have to archive any software they use. More generally, they
must "think archiving" while doing research, and make a record of
everything from the full software environment to the computer
hardware, to the format and units of the data.

The APA concedes that archiving will cost extra money - cash some
will argue should be spent on scientific discovery - but insists
that it is essential if science's heritage is to be protected.
"If we don't get it," says Gerietta, "scientific data like Earth
observations, which can never be repeated, will be irretrievably
lost."





More information about the AccessIndia mailing list