The British Library has launched a web archive designed to preserve pages from UK web domains, much as the library preserves a physical archive of British Books and other publications. The system – which includes the open source Hadoop software – has been built by IBM.
The archive, to be launched this evening, will include special sites gathering together web material referring to specific subjects, including the Credit Crunch and Anthony Gormley’s Trafalgar Square “Fourth Plinth” project from summer 2009. A site covering this year’s general election is already planned.
“Fifteen petabyte of data is created daily,” said David Boloker, IBM’s chief technology officer for emerging Internet technologies – and much of this data is never saved or stored, he said. Ten percent of UK websites disappear within six months, and the average life expectancy of data online is around 44 to 75 days. the BL project c;aims
The project follows at least six years of work by the British Library – mostly on the practicality and legality (under copyright law) of copying UK web data.
Since 2004, the British Library has led the UK Web Archive in its mission to archive a record of the major cultural and social issues being discussed online,” said British Library chief executive, Dame Lynne Brindley. “Throughout the project the Library has worked directly with copyright holders to capture and preserve over 6,000 carefully selected websites, helping to avoid the creation of a ‘digital black hole’ in the nation’s memory.”
The library has been working to extend the Legal Deposit requirement, under which publications must give a copy of each issue to the library, and there are plans to extend the requirement to online material. “Limited by the existing legal position, at the current rate it will be feasible to collect just one percent of all free UK websites by 2011. We hope the current DCMS consultation will enact the 2003 Legal Deposit Libraries Act and extend the provision of legal deposit through regulation to cover freely available UK websites, providing regular snapshots of the free UK web domain for the benefit of future research.”
The UK web domain currently houses eight million sites, and is rapidly expanding, providing a continuously updated archive of social and cultural issues in Britain, the Library said. “Despite common misperceptions, material that is freely available on the web is still subject to copyright and cannot be archived without permission,” said Dame Lynne, “a time consuming, expensive, and often impossible task.”
The archive uses BigSheets, a mashup project IBM has built, based on the open source Hadoop project from Apache. Hadoop is also in use by public cloud services such as Amazon’s, and distributed by Yahoo!
Google parent Alphabet sees market capitalisation surge over $2tn on plan to over first-ever cash…
Google asks Virginia federal court to dismiss case brought by US Justice Department and eight…
Snapchat parent Snap reports user growth, revenues in spite of tough competition, in what may…
Intel shares sag after company shares gloomy revenue predictions, as data centre chip demand hit…
Germany's Tuta Mail says Google broke EU's new DMA rules with March algorithm update that…
US auto safety regulator opens new investigation into adequacy of Tesla Autopilot recall, saying it…
View Comments
To ensure that their website content is archived for the future, Organisations can automatically save daily screen-shots of all their web pages, which are then saved for either compliance, legal or just general interest purposes.
Cloud Testing, a UK company has just launched it's service Website-Archive, which is available at http://www.website-archive.com/