With support from the University of Richmond

History News Network puts current events into historical perspective. Subscribe to our newsletter for new perspectives on the ways history continues to resonate in the present. Explore our archive of thousands of original op-eds and curated stories from around the web. Join us to learn more about the past, now.

New technologies and open-source software to help the Library of Congress save 'brittle' history

The Library of Congress, where thousands of rare public domain documents relating to America's history are stored and slowly decaying, is about to begin an ambitious project to digitize these fragile documents using Linux-based systems and publish the results online in multiple formats.

Thanks to a $2 million grant from the Sloan Foundation, "Digitizing American Imprints at the Library of Congress" will begin the task of digitizing these rare materials -- including Civil War and genealogical documents, technical and artistic works concerning photography, scores of books, and the 850 titles written, printed, edited, or published by Benjamin Franklin.

According to Brewster Kahle of the Internet Archive, which developed the digitizing technology, open source software will play an "absolutely critical" role in getting the job done...

Image processing for an average book takes about 10 hours on the cluster [of Linux computers], and while the project still uses proprietary optical character recognition (OCR) software, Kahle says that many open source applications come into play...and the software performs "a lot of image manipulation, cropping, de-skewing, correcting color to normalize it -- [it] does compression, optical character recognition, and packaging into a searchable, downloadable PDF; searchable, downloadable DjVu files; and an on-screen representation we call the Flip Book."...

A good number of the historic materials in question are old, fragile, and in such rough shape that placing them in Scribe's cradle, or even attempting to read them, could irreparably damage them.

[Dr. Jeremy E. A. Adamson, the library's director for collections and services] says that some of the books, for example, have pages "that have become brittle with age"; while Adamson says these materials are in a broad range of conditions that limit their physical handling, he uses the general term "brittle books" to describe it. No list of such brittle materials at the Library of Congress has been made, but Adamson says that "they comprise a percentage of virtually every collection."

Adamson says the project's objectives include the development of a more formal classification and description of these "brittle" materials, and to "establish digitization workflows based on that classification of condition."

Read entire article at Linux.com