Web data is vanishing – but copyright laws make it hard for the British Library to capture it in its archive
The British Library has launched a web archive designed to preserve pages from UK web domains, much as the library preserves a physical archive of British Books and other publications. The system – which includes the open source Hadoop software – has been built by IBM.
The archive, to be launched this evening, will include special sites gathering together web material referring to specific subjects, including the Credit Crunch and Anthony Gormley’s Trafalgar Square “Fourth Plinth” project from summer 2009. A site covering this year’s general election is already planned.
“Fifteen petabyte of data is created daily,” said David Boloker, IBM’s chief technology officer for emerging Internet technologies – and much of this data is never saved or stored, he said. Ten percent of UK websites disappear within six months, and the average life expectancy of data online is around 44 to 75 days. the BL project c;aims
The project follows at least six years of work by the British Library – mostly on the practicality and legality (under copyright law) of copying UK web data.
Since 2004, the British Library has led the UK Web Archive in its mission to archive a record of the major cultural and social issues being discussed online,” said British Library chief executive, Dame Lynne Brindley. “Throughout the project the Library has worked directly with copyright holders to capture and preserve over 6,000 carefully selected websites, helping to avoid the creation of a ‘digital black hole’ in the nation’s memory.”
The library has been working to extend the Legal Deposit requirement, under which publications must give a copy of each issue to the library, and there are plans to extend the requirement to online material. “Limited by the existing legal position, at the current rate it will be feasible to collect just one percent of all free UK websites by 2011. We hope the current DCMS consultation will enact the 2003 Legal Deposit Libraries Act and extend the provision of legal deposit through regulation to cover freely available UK websites, providing regular snapshots of the free UK web domain for the benefit of future research.”
The UK web domain currently houses eight million sites, and is rapidly expanding, providing a continuously updated archive of social and cultural issues in Britain, the Library said. “Despite common misperceptions, material that is freely available on the web is still subject to copyright and cannot be archived without permission,” said Dame Lynne, “a time consuming, expensive, and often impossible task.”
The archive uses BigSheets, a mashup project IBM has built, based on the open source Hadoop project from Apache. Hadoop is also in use by public cloud services such as Amazon’s, and distributed by Yahoo!
The BigSheets prototype is in use within six organisatsion worldwide, said Boloker – the others include commercial organisations in fields such as pharmaceuticals. It is designed to handle unstructured data from Web-based repositories; which it “enriches” with what IBM describes as an “unstructured information management architecture”, supporting things like tag clouds.