British Library Launches UK Web Archive

The British Library has launched a web archive designed to preserve pages from UK web domains, much as the library preserves a physical archive of British Books and other publications. The system – which includes the open source Hadoop software – has been built by IBM.

The archive, to be launched this evening, will include special sites gathering together web material referring to specific subjects, including the Credit Crunch and Anthony Gormley’s Trafalgar Square “Fourth Plinth” project from  summer 2009. A site covering this year’s general election is already planned.

“Fifteen petabyte of data is created daily,” said David Boloker, IBM’s chief technology officer for emerging Internet technologies – and much of this data is never saved or stored, he said. Ten percent of UK websites disappear within six months, and the average life expectancy of data online is around 44 to 75 days. the BL project c;aims

The project follows at least six years of work by the British Library – mostly on the practicality and legality (under copyright law) of copying UK web data.

Since 2004, the British Library has led the UK Web Archive in its mission to archive a record of the major cultural and social issues being discussed online,” said  British Library chief executive, Dame Lynne Brindley. “Throughout the project the Library has worked directly with copyright holders to capture and preserve over 6,000 carefully selected websites, helping to avoid the creation of a ‘digital black hole’ in the nation’s memory.”

The library has been working to extend the Legal Deposit requirement, under which publications must give a copy of each issue to the library, and there are plans to extend the  requirement to online material. “Limited by the existing legal position, at the current rate it will be feasible to collect just one percent of all free UK websites by 2011. We hope the current DCMS consultation will enact the 2003 Legal Deposit Libraries Act and extend the provision of legal deposit through regulation to cover freely available UK websites, providing regular snapshots of the free UK web domain for the benefit of future research.”

The UK web domain currently houses eight million sites, and is rapidly expanding, providing a continuously updated archive of social and cultural issues in Britain, the Library said. “Despite common misperceptions, material that is freely available on the web is still subject to copyright and cannot be archived without permission,” said Dame Lynne, “a time consuming, expensive, and often impossible task.”

The archive uses BigSheets, a mashup project IBM has built, based on the open source Hadoop project from Apache. Hadoop is also in use by public cloud services such as Amazon’s, and distributed by Yahoo!

The BigSheets prototype is in use within six organisatsion worldwide, said Boloker – the others include commercial organisations in fields such as pharmaceuticals. It is designed to handle unstructured data from Web-based repositories; which it “enriches” with what IBM describes as an “unstructured information management architecture”, supporting things like tag clouds.

Peter Judge

Peter Judge has been involved with tech B2B publishing in the UK for many years, working at Ziff-Davis, ZDNet, IDG and Reed. His main interests are networking security, mobility and cloud

View Comments

  • To ensure that their website content is archived for the future, Organisations can automatically save daily screen-shots of all their web pages, which are then saved for either compliance, legal or just general interest purposes.

    Cloud Testing, a UK company has just launched it's service Website-Archive, which is available at http://www.website-archive.com/

Recent Posts

Alphabet Value Surges Over $2tn On Dividend Plan

Google parent Alphabet sees market capitalisation surge over $2tn on plan to over first-ever cash…

3 hours ago

Google Asks US Court To Dismiss Federal Adtech Case

Google asks Virginia federal court to dismiss case brought by US Justice Department and eight…

4 hours ago

Snap Sees Surge In Users, Ad Revenues

Snapchat parent Snap reports user growth, revenues in spite of tough competition, in what may…

4 hours ago

Intel Shares Sink As AI Surge Hits Chip Revenue

Intel shares sag after company shares gloomy revenue predictions, as data centre chip demand hit…

5 hours ago

Email Provider Complains To EU Over Reduced Google Rankings

Germany's Tuta Mail says Google broke EU's new DMA rules with March algorithm update that…

6 hours ago

US Regulator Probes Effectiveness Of Tesla Autopilot Recall

US auto safety regulator opens new investigation into adequacy of Tesla Autopilot recall, saying it…

6 hours ago