Google Developing Caffeine Storage System

This new storage system will include more diagnostic and historic data and autonomic software

Google has been ahead of its time in more than just Web search and online consumer tools. Out of sheer necessity, it’s also been way ahead of the curve in designing massive-scale storage systems built mostly on off-the-shelf servers, storage arrays and networking equipment.

As the world’s largest Internet search company continues to grow at a breakneck pace, it is now in the process of creating its second custom-designed data storage file system in 10 years.

This new storage system, the back end of the new Caffeine search engine that Google introduced Aug. 10 and is now testing, will include more diagnostic and historic data and autonomic software, so the system can think more for itself and solve problems long before human intervention is actually needed.

Who knew 10 years ago, when it was the newbie on the block to Yahoo’s market-leading search engine, that Google would grow into a staple of Internet organization that is relied upon by hundreds of millions of users each day?

Just before Rackable sold Google its first 10,000 servers in 1999 and started the company on a server-and-array collection rampage than may total in the hundreds of thousands of boxes, Google engineers were pretty much into making their own servers and storage arrays.

“In 1999, at the peak of the dot-com boom when everybody was buying nice Sun machines, we were buying bare motherboards, putting them on corkboard, and laying hard drives on top of it. This was not a reliable computing platform,” Sean Quinlan, Google’s lead software storage engineer, said with a laugh at a recent storage conference. “But this is what Google was built on top of.”

It would be no surprise to any knowledgable storage engineer that this rudimentary file system had major problems with overheating to go with numerous networking and PDU failures.

“Sometimes, 500 to 1,000 servers would disappear from the system and take hours to come back,” Quinlan said. “And those were just the problems we expected. Then there are always those you didn’t expect.”

Eventually, Google engineers were able to get their own clustered storage file system — called, amazingly enough, Google File System (GFS) — up and running with decent performance to connect all these quickly custom-built servers and arrays. It consisted of what Quinlan called a “familiar interface, though not specifically Posix. We tend to cut corners and do our own thing at Google.”

What Google was doing was simply taking a data center full of machines and layering a file system as an application across all the servers to get open/close/read/write, without really caring where the data is in the machine, Quinlan said.

But there was a big problem. The GFS lacked something very basic: automatic failover if the master went down. Admins had to manually restore the master, and Google went dark for as long as an hour at times. Although failover was later added, when it kicked in it was annoying to users, because the lapse often was several minutes in length. Quinlan says it’s down now to about 10 seconds.

Eventually, the growth of the company and its subsequent IPO in 2004 spurred even more growth, so a modification to the file system was designed and built. This was called BigTable (developed in 2005-06), a distributed database-like file system built atop GFS with its own “familiar” interface; Quinlan said it is not Microsoft SQL.

This is the part of the system that runs user-facing applications. There are hundreds of instances (called cells) of each of these systems, and each of those cells scales up into thousands of servers and petabytes of data, Quinlan said.

At the base of much of this are Rackable’s Eco-Logical storage servers, which are clustered to run on Linux to produce storage capacity as high as 273TB per cabinet. Of course, Google now uses a wide array of storage vendors, because it’s all but impossible for one vendor to supply the huge number of boxes needed by the search monster each year.

The Eco-Logical storage arrays feature high-efficiency, low-power consumption and intelligent design intended to improve price performance per watt, in even very complex computing environments, Geoffrey Noer, Rackable’s senior director of product management, told eWEEK.

The original Google storage file systems have served the company very well; the company’s overall performance proves this. But now, in 2009, the continued stratospheric growth of Web, business and personal content and ever-increasing demands to keep order on the Internet mean that Quinlan and his team have had to come up with yet another super-file system.

Although Google folks will not officially sanction this information for general consumption, this overhaul of the Google File System apparently has been undergoing internal testing as part of the company’s new Caffeine infrastructure announced earlier this month.

Google on 10 Aug introduced a new “developer sandbox” for a faster, more accurate search engine and invited the public to test the product and provide feedback about the results. The sandbox site is here; as might be expected, there’s also a new storage file system behind it.

“By far the biggest challenge is dealing with the reliability of the system. We’re building on top of this really flaky hardware — people have high expectations when they store data at Google and with internal applications,” Quinlan said.

“We are operating in a mode where failure is commonplace. The system has to be automated in terms of how to deal with that. We do checksumming up the wazoo to detect errors, and using replication to allow recovery.”

Chunks of data, distributed throughout the vast Google system and subsystems, are replicated on different “chunkserver” racks, with triplication default and higher-speed replication relegated for hot spots in the system.

“Keeping three copies gives us reliability to allow us to survive our failure rates,” Quinlan said.

Replication enables Google to use the full bandwidth of the cluster, reduces the window of vulnerability, and spreads out the recovery load so as not to overburden portions of the system. Google uses the University of Connecticut’s Reed-Solomon error correction software in its RAID 6 systems.

Google stores virtually of its data in two forms: RecordIO — “a sequential series of records, typically representing some sort of log,” Quinlan said — and SSTables.

“SStables are immutable, key/value pair, sorted tables with indexes on them,” Quinlan said. “Those two data structures are fairly simple; there’s no update in place. All the records are either sequential through the RecordIO or streaming through the SSTable. This helps us a lot when building these [new] reliable systems.”

As for the semi-structured data storage, stored in BigTable’s row/column/timestamp subsystem, the URLs, the per-user data and the geographic locations are the data sets stored that are constantly being updated.

“And the scale of these things is large, with the size of the Internet and the number of people using Google,” Quinlan said, in an understatement. Google is storing billions of URLs, hundreds of millions of page versions (with an average size of 20KB per data file version), and hundreds of terabytes of satellite image data. Hundreds of millions of users use Google daily.

When the data is stored into tables, Google then breaks up tables into chunks called tablets. “These are the basics that are distributed around our system,” Quinland said. “This is a simple model, and it’s worked fairly effectively.”

How the basic Google search system works: “A request comes in. We log it in GFS; it updates the storage. We then buffer it in memory in a sorted table. When that memory buffer fills up, we write that out as an SSTable; it’s immutable data, it’s locked down, we don’t modify it.

“The request then reads through SSTables [to find the query answer].”

This is a fairly straightforward and simple process, Quinlan said. At the rate the Google search engine is used on a day-to-day basis, it has to be simple.

Scale remains the biggest issue. “Everything’s getting bigger, we’re growing exponentially. We’re not quite an exabyte system today, but this is definitely in the not-too-distant future,” Quinlan said. “I get blase about petabytes now.”

More automated operation is in the cards. “Our ability to hire people to run these systems is not growing exponentially, so have to automate more and more. We want to bring what used to be done manually into the systems. We want to bring more and more history of information about what’s going on in the system — to allow the system itself to diagnose slow machines, diagnose various problems and rectify them itself,” Quinlan said.

How to build these systems on a much more global basis is another Quinlan goal.

“We have many data centers across the world,” he said. “On an application point of view, they all need to know exactly where the data is. They’ll often have to do replication across data centres for availability, and they have to partition their users across these data centers. We’re trying to bring that logic into the storage systems themselves.”

So, as Caffeine search is being tested now, so is the new storage file system. Google hopes this will be one that is flexible and self-healing enough to be around for awhile.