Facebook Data Centre Engineer Develops Troubleshooting Tool

A Facebook engineer has built a heatmap monitoring tool to help troubleshoot data centre infrastructure

A Facebook data centre engineer has built a monitoring tool called Claspin that uses heat maps to troubleshoot potential problems in data centres.

The engineer, Sean Lynch, described in a blog posting how he and his fellow engineers need to ensure the health of the social networking giant’s cache systems by quickly identifying and fixing any potential problems with server, racks or clusters.

Facebook Data Centre

Facebook has two major cache systems said Lynch. “Memcache is a simple lookaside cache with most of its smarts in the client, and TAO, a caching graph database that does its own queries to MySQL.”

“Between these two systems, we have literally thousands of charts, some of which are collected into dashboards showing various latency, request rate, and error rate statistics collected by clients and servers,” wrote Lynch.

Lynch described how these dashboards worked well at first, but as Facebook grew and its systems became increasingly complex, it became more and more difficult to figure out which piece was broken when something went wrong. He then started to think about a tool or system that would provide quick visual insights into the status of cache, “analogous to meters and traffic lights.”

Lynch explained how he named Claspin thanks to a suggestion from a friend who had a background in organic chemistry. “Claspin” is a protein that monitors for DNA damage in a cell.

Heatmap Tool

Lynch’s first attempt to build a tool resulted in a command line tool that outputted a lot of text. But Lynch wanted something to visually represent potential problems and settled on the idea of heatmaps.

“I’d been fond of heatmaps for quite a while, but it wasn’t entirely clear to me how to lay out this data in two dimensions in a way that would be meaningful to the user. It seemed somewhat obvious that we wanted each “pixel” of the heatmap to represent a host, with racks grouped together,” wrote Lynch. “However, our racks don’t necessarily have the same number of hosts in them, and it wasn’t clear how to colour individual hosts when we have about a dozen metrics for each. Eventually I realised that all we cared about was whether anything was wrong with a host. So I settled on colouring a host by its “hottest” statistic, with hotness computed from predefined thresholds. It’s dirt simple, but it gives us a way to encode tribal knowledge about what values are “bad” into the view.”

Lynch said that hosts that are missing a stat are coloured black, indicating that the host is probably down.

He eventually settled on a separate heatmap per cluster, ordered by rack number and with each rack drawn vertically in an alternating “snake” pattern so racks would stay contiguous even if they wrapped around the top or bottom. “The rack names naturally sort by data centre, then cluster, then row, so problems common at any of these levels are readily apparent,” he wrote.

“Claspin allows us to visualise a ridiculous amount of information at once, in a way that makes it easy to spot problems and patterns,” Lynch said. “On a 30″ screen we could easily fit 10,000 hosts at the same time, with 30 or more stats contributing to their colour, updated in real time – usually in a matter of seconds or minutes.”

Open Source?

Although Claspin is geared towards to Facebook’s own internal hardware configurations, the social network giant is reportedly considering offering the tool as an open source option. In the past for example, it has open sourced other internal tools.

Facebook also launched its Open Compute Project in April 2011, after it built a highly efficient data centre in Prineville, Oregon. That source project first revealed its original Open Rack specification (version 0.5) back in December 2011 and earlier this week it released the second version, dubbed the v1.0 Open Rack specification.

Do you know all about Green IT? Take our quiz!