Six Libraries To Archive A Copy Of British Internet

Digital book © Gts Shutterstock 2012

The new Legal Deposit regulation will require publishers to allow qualified libraries take copies of all digital content

From Saturday, six major libraries in the UK will start archiving digital content, hoping to eventually hold a copy of every website hosted in .uk domain space.

New Legal Deposit regulation will enable the British Library, the National Library of Scotland, the National Library of Wales, the Bodleian Libraries, Cambridge University Library and Trinity College Library in Dublin to collect a copy of every digital publication in Britain, just like they do with print editions.

The archive will serve to preserve the nation’s cultural heritage and make it available to future generations. In time, it could evolve into a database holding every public tweet or Facebook page.

Digital content comes of age

Libraries in the UK have been archiving printed media for centuries – British Library alone stores 150 million physical entries “representing every age of written civilisation”. This enormous collection was made possible thanks to Legal Deposit, a practice that requires publishers to submit copies of their work to one of the officially certified libraries, which has been enshrined in English law since 1662.

British Library © Shenjun Zhang, Shutterstock 2012A decade ago, the government decided to update the law with the Legal Deposit Libraries Act of 2003, which extended the rules to include e-books, CDs, DVDs and websites. It will officially come into force on 6 April.

“Preserving and maintaining a record of everything that has been published provides a priceless resource for the researchers of today and the future. So it’s right that these long-standing arrangements have now been brought up to date for the 21st century, covering the UK’s digital publications for the first time,” said Culture Minister Ed Vaizey.

“Digital content can now be effectively archived and our academic and literary heritage preserved, in whatever form it takes.”

Initially, the project will ‘harvest’ 4.8 million websites containing over a billion pages, including copies of password-protected or paid-for content. By the end of this year, the results of the first archiving crawl of the .uk domain will be available to researchers. As for the public, access to non-print materials will be offered through on-site reading room facilities at each of the participating libraries.

The capacity of the archive is expected to be constantly upgraded over the coming years. The system – which includes the open source Hadoop software – has been built by IBM.

According to leader of the project Lucie Burgess, a lot of material related to events such as the 7/7 bombings or the 2008 financial crisis has already been lost or taken down.

“We will have to distinguish between content published in the UK and elsewhere but in principle we will be able to archive the publicly available tweets of any individual, company or organisation,” Burgess told AFP.

“Ten years ago, there was a very real danger of a black hole opening up and swallowing our digital heritage, with millions of web pages, e-publications and other non-print items falling through the cracks of a system that was devised primarily to capture ink and paper,” said Roly Keating, CEO of the British Library.

“The regulations now coming into force make digital legal deposit a reality, and ensure that the Legal Deposit Libraries themselves are able to evolve – collecting, preserving and providing long-term access to the profusion of cultural and intellectual content appearing online or in other digital formats.”

Are you fluent in the language of the Internet? Take our quiz!