Google App Engine, Bobby Tables, And #bugsthatuwontfind
Google App Engine went into extended downtime last Thursday. The Tweetosphere was up in arms, and there were a lot of pissed off people roaming the interwebs. Understandable. Downtime sucks for everyone, whether it’s your app, or you’re the user and you just need that damn duck jambalaya that you saved on that spiffy new app you discovered last week. The cause of the downtime? A single malformed file pointer passed from a single application unintentionally exploited a bug in the GFS Master Node that runs the Data Store. More from the official Google post:
The root cause of the outage was a bug in the GFS Master server caused by another client in the data center sending it an improperly formed file handle which had not been safely sanitized on the server side, and thus caused a stack overflow on the Master when processed.
I’m oversimplifying. That single fault actually caused a cascade of unforeseen circumstances that took a not-insignificant amount of time for Google’s engineers to patch up. The details are in the extended downtime information that Google provided and I won’t go into them here.
What I find both comforting and disturbing about this castrophany is that it uncovered a who-knows-how-old bug in GFS. The problem had been experienced a week earlier, but given the nature of the App Engine DataStore, I have to think if the engineers knew what caused the problem that it would have been patched sooner rather than later:
8:00 AM — The cause of the GFS Master failures has not yet been identified. However, a similar-looking issue that had been seen in a different data center the week prior had been resolved by an upgrade to a newer version of the GFS software.
It sounds to me like this bug was accidentally fixed, rather that discovered, recreated, and intentionally patched. It’s difficult, however, to do much more than guess at Google’s internal software upgrade schedule, so this is all just a bunch of guesswork. It just goes to show that you can never be too paranoid about what your users are doing on your platform. I’m reminded of the XKCD comic: Exploits of a Mom
![Exploits of a Mom [via XKCD]](http://imgs.xkcd.com/comics/exploits_of_a_mom.png)