Thursday, September 3, 2009

A Data Deluge Swamps Science Historians

As Paper Trails Fade, Digital Material Grows in Size and Complexity; How to Decipher Those 80-Column Punch Cards

By The Wall Street Journal

In a vault beneath the British Library here, some grapple with a formidable challenge in digital life. The library's first curator of eManuscripts, is working on ways to archive the deluge of computer data swamping scientists so that future generations can authenticate today's discoveries and better understand the people who made them.

Their task is only getting harder. Scientists who use laptops to collaborate via email, Google, YouTube, Flickr and Facebook are leaving fewer paper trails, while the information technologies that do document their accomplishments can be incomprehensible to other researchers and historians trying to read them. PCs -intensive experiments and the software used to analyze their output generate millions of gigabytes of data that are stored or retrieved by electronic systems that quickly become obsolete.

It would be tragic if there were no record of lives that were so influential.

Usually, historians are hard-pressed to find any original source material about those who have shaped our civilization. In the Internet era, scholars of science might have too much. Never have so many people generated so much digital data or been able to lose so much of it so quickly. Desktops world-wide generate enough digital data every 15 minutes to fill the U.S. Library of Congress.

In fact, more technical data have been collected in the past year alone than in all previous years since science began. The data is doubling every year.

The problem is forcing historians to become scientists, and scientists to become archivists and curators. Digital records, unlike laboratory notebooks, can't be read without the proper hardware, software and passwords. Electronic copies are difficult to verify and are easy to alter or forge. Digital records "can be more direct, more immediate and more candid. But how can we demonstrate to people in the future that these are the real thing?

These files likely contained crucial drafts of research papers, emails and other information that could illuminate an influential life of science, as recorded through 40 years of computing technology.

To extract the antiquated data required more than a password.Scientist gradually assembled a collection of vintage computers, old tape drives and forensic data-recovery devices in a locked library sub-basement.

For more than a decade, policy makers and data experts have been debating the best way to preserve important digital records. "What you keep and how you pay for it are difficult issues.

The growing scale of new science projects, however, has university data custodians worried. We are swimming in data these days, and people are overwhelmed.

Consider a new computerized star atlas called the Sloan Digital Sky Survey. Using a telescope in New Mexico, the project in its first two days collected more data than gathered in all the previous history of astronomy. Its final data set catalogs 230 million celestial objects, encompassing 930,000 galaxies, 120,000 quasars and 225,000 stars, all encoded in 140 terabytes of digital data.

The next generation of experiments will be even more data-intensive. A new proton smasher near Geneva called the Large Hadron Collider is supposed to produce 15 million gigabytes of data annually -- enough to fill more than 1.7 million DVDs every year. The Large Synoptic Survey Telescope, an astronomy program under construction in northern Chile slated to launch in 2016, will regularly image the entire sky, recording more than 30,000 gigabytes of data every night.

Earlier this month, the U.S. National Science Foundation awarded $20 million to the Data Conservancy and another $20 million to the DataONE group to develop more effective data-preservation tools over the next five years, especially for researchers working on their own or in small teams.

For future generations to get much use from 21st-century data, though, it won't be enough to simply archive email exchanges and file formats. The problem is to actually capture the way scientists interact with the data. Today's graduate students are starting to use instant messaging in their scientific work. We have to figure out how to capture these.

In the long run, no scientific data can outlast the storage media that contains it, unless it can be accurately recopied and reliably re-authenticated. Many computer CDs, DVDs and flash drives last only a decade or so. The oldest known star atlas, inscribed on a scroll discovered in Dunhuang, China, has survived for more than 1,000 years. It might have been traced from an even older star map.

Earlier this year, researchers unveiled a memory chip designed to last for centuries. In April, physicists published the design of a digital device that could store data for a billion years, at least in theory.