I think another fun story about scalability is our usernames launch…So, this was a product launch like many at Facebook, that was quite controversial, as we talked about doing it. So we decided that, you know, kind of a lot of sites allow you to have a short name, and a lot of people have their association with Facebook as their personal identity, so it seems to make sense that instead of being facebook.com/uid, you know, blah blah blah, I could actually have a nice, easy-to-read username. So, this was all well and good, except that when most people do this, they have zero million users — and at the time, we had a little over 200 million users. So the problem was, how exactly do you get 200 million users to pick a username, and do this in a fair, and kind of balanced manner? So pretty much every way of allocating a scarce resource was debated, standard auctions like eBay, VCG auctions, for you auction nerds out there, which give you optimal results, but are confusing as hell to explain to someone what’s going on — we actually user-tested that one and people had no idea what they were doing — so, it’s like, telling them that it was the fairest way to allocate wasn’t going to make it any better — so we finally said, look, let’s just do this the simple way, everyone’s used to standing in line, first-come, first-served.
The problem with standing in line first-come first served, is that it basically means we’re asking 200 million people to show up at our website at exactly the same second on one evening. When that happens, most people call that something like a denial of service attack. For us, it’s like, a product launch. — Mike Schroepfer
Last Wednesday, Facebook brought their Seattle Engineering Roadshow to EMP. The event featured a presentation by Mike Schroepfer, Facebook’s Vice President of Engineering, on the challenges of scale.
Mike began by throwing out some impressive stats: Facebook boasts over 300 million active users, and serves around five billion external API calls per day. With the help of aggressive in-memory (memcached) and in-network (CDN-based) caching, Facebook serves 1.2 million photos/second to its users. The ste is built entirely on open-source infrastructure — mostly LAMP, but a few non-PHP languages as well. (Facebook’s chat server is written in Erlang). The production database is a replicated group of MySQL machines.
For most sites, scaling up is a well-understood problem: simply add more hardware, and partition the workload among the hardware using load-balancers. When Facebook was introduced, each network (at the time, US universities) had its own separate instance of the service, with minimal connectivity between the instances. As the site evolved toward its current, more unified experience, figuring out how to load-balance the site was very tough. From a computer science perspective, partitioning queries onto Facebook’s core data structure — the friend graph — is not an easy problem.
Over time, the site has grown from a simple PHP web app, to a set of distributed services running across many machines. Facebook’s highly optimized services perform the meat-and-potatoes work of the site: friend graph intersection, profile lookups, etc. Rendering a single page might involve queries to dozens of these services.
After his statistical tour de force, Mike touched on the importance of culture. Facebook encourages people to experiment and break things. Scaling is all about removing bottlenecks; when something gets in the way, a team of 2-3 engineers spends a few weeks inventing a solution to the problem. In addition to being the largest contributor to memcached, Facebook has open-sourced many of their in-house innovations, including Cassandra and Scribe.
Overall, I was extremely impressed with the talk, and the organization Mike described. Throughout the talk, Mike’s intensely optimistic, can-do attitude shined through, without an iota of pretense or stuffiness. If you’re looking to get a job in the Valley, definitely give these guys some thought.