Facebook Doesn’t Have Big Data. It Has Ginormous Data.

2/14/13Follow @wroush

One thing that makes Facebook different from most other consumer Internet services is the vast scale of the data it must manage. Start with a billion-plus members, each with 140 friends, on average. Add some 240 billion photos, with 350 million more being uploaded every day. Mix in messages, status updates, check-ins, and targeted ads, and make sure every page is personalized and continuously updated for every user, and you’ve definitely got a first-class big data challenge.

But the data itself is not what makes Facebook (NASDAQ: FB) unique or successful. After all, there are organizations that own larger databases—think Google, Amazon, the CIA, the NSA, and the major telecom companies. But none of them can claim to keep customers fixated on their sites for an astounding 7 hours per month. To do that, you have to understand your users and what they want. And the true source of innovation at Facebook, as I’ve been learning lately, is the data it has about all the data it has.

Every move you make on Facebook leaves a digital trail. When you log in, log out, “like” a friend’s photo, click on an ad, visit the fan page for a band or a TV show, or try a new feature, Facebook takes note. It adds these behavioral tidbits to its activity logs, which take up hundreds of petabytes and are stored in giant back-end databases for analytics purposes. (To be specific, the logs live in giant custom-built data centers, on clusters of servers running the open-source Hadoop distributed computing framework.)

This back end is separate from, but just as important as, the front-end systems that store your personal data and generate Facebook’s public user interface. At least 1,000 of Facebook’s 4,600 employees use the back end every day, mainly to monitor and understand the results of the tens of thousands of tests that are being run on the site at all times.

Which leads to a larger point: there is no single “Facebook.”

As a product, Facebook is about as Protean as a non-mythological entity can get. It takes a constantly shifting form depending on what new features or designs Facebook’s engineers are trying out at any given hour, in any given geography around the world. “You and your friends are seeing subtly different Facebook pages, you just don’t know it,” says Santosh Janardhan, Facebook’s manager of database administration and storage systems.

Analytics, and the infrastructure that supports it, are the key to Facebook’s constant self-optimization. In conversations with Janardhan and other top Facebook engineers, I’ve been getting an introduction to the company’s analytics back end, which is arguably the most complex and sophisticated on the consumer Web. Indeed, if you want a glimpse of how other consumer-facing tech companies may be managing and exploiting big data in the future, it would be smart to look first to Facebook, which logs so much information on user behavior that it’s had to build its own storage hardware and data management software to handle it all.

Jay Parikh, Facebook's vice president of infrastructure engineering.

Jay Parikh, Facebook's vice president of infrastructure engineering.

“Everything we do here is a big data problem,” says Jay Parikh, Facebook’s vice president of infrastructure engineering. “There is nothing here that is small. It’s either big or ginormous. Everything is at an order of magnitude where there is not a packaged solution that exists out there in the world, generally speaking.”

Unlike Google, which stays mostly mum about the details of its infrastructure, Facebook shares much of what it’s learning about managing its ginormous data stores. It has released many of its custom database and analytics tools to the open source community, and through the Open Compute Project, it’s even giving away its hardware tricks, sharing the specifications of its home-grown servers, storage devices, and data centers.

There’s probably an aspect of karma to this open source strategy; Facebook shares what it learns “in hopes of accelerating innovation across all these different levels of the stack,” Parikh says. Of course, the sharing also bolsters Facebook’s image as a cool place for engineers to work, even post-IPO. (Parikh’s team has been busy on the media circuit lately, appearing in a Wired magazine feature as well as this article.)

But whatever the company’s motivations for opening up, the world should be watching and learning. Facebook is the new definition of a data-driven company. How big data is actually used to shape a big business is a question still shrouded in mystery for most observers. At Facebook, an answer is emerging: it involves using detailed data on user behavior to guide product decisions, and—just as telling—building a lot of new software and hardware to store and handle that data. The engineers who oversee that process hold the keys to the company’s growth.

Making the Machine Do the Work

If Facebook were to show up on a cable-TV reality series, it would probably be Hoarders. The starting point of the company’s philosophy about analytics is “keep everything.”

That’s different from the historical norm in the analytics or business intelligence sectors. Because old or “offline” data is expensive to store and difficult to retrieve, IT departments generally either throw it out, archive it on tape, or filter and reduce it into data warehouses designed to speed up certain kinds of queries. If your CEO wants a regular report on sales by region, for example, you set up your whole data store around those statistics.

Facebook doesn’t think this way. “If you only expect certain types of questions to be asked, you only store that data,” says Sameet Agarwal, a director of engineering at Facebook who came to the company in 2011 after 16 years at Microsoft. And that practice, in turn, “tends to bias you toward the answers you were already expecting.”

Facebook’s approach “has always been to store all the data first, before you know what questions you want to ask, because that will lead you to more interesting questions,” Agarwal says.

Indeed, some of the activity data stored in Facebook’s back end dates all the way back to 2005, when the company was just two years old, Agarwal says. He says his team only deletes data when it is required to do so for security, privacy, or regulatory reasons.

But that commitment to storing everything causes considerable bother. To keep all of its analytics data accessible, Facebook operates several Hadoop clusters, each consisting of several thousand servers. The largest cluster stores more than 100 petabytes. (A petabyte is a thousand terabytes, or a million gigabytes.)

To support both its back end and its front end, the company has built or is building numerous data centers in far-flung locations, from Oregon to North Carolina to Sweden. According to the IPO registration documents it filed in 2012, Facebook spent $606 million on its data center infrastructure in 2011. Each new data center will have a capacity of roughly 3 exabytes (3,000 petabytes).

Facebook director of engineering Sameet Agarwal

Facebook director of engineering Sameet Agarwal

The offline analytics clusters are the biggest of the big databases at Facebook, because they can’t easily be divided or “sharded” into geographic subsets the way the front-end databases can be, according to Agarwal. To answer the most interesting questions, “you want all the data for all the users in one place,” he explains.

Keeping these huge clusters running is the job of Parikh’s team. It might sound like a low-pressure task compared to maintaining the front end, but Agarwal says there’s little room for downtime, for at least two reasons. First, roughly a quarter of the people inside Facebook depend on the analytics platform to do their daily jobs. Second, the back end actually powers many of the features users see on the front end, such as “People You Might Know,” a list of friend suggestions based on an analysis of each Facebook user’s social network and personal history.

“The longer this is down, the more your user experience will degrade over time,” says Janardhan. “The ads we serve might not be as relevant, the people and the suggestions for doing things that we serve might be less relevant, and so forth.”

To keep such large collections of machines running with close to 100 percent availability, Facebook turns to extreme automation. There are other ways to run a large company network, of course—most corporations with a distributed system of data centers would have a network operations center or NOC staffed by scores of people around the clock. Again, that’s not the Facebook way.

The heavy engineering bias at Facebook means there’s a culture of “impatience or intolerance for trivial work,” in Janardhan’s words. The first time a data center engineer gets awakened at 2:00 am to fix a faulty server or network switch, he might apply a temporary band-aid, Janardhan says. “But if they are woken up twice in a week for the same problem, they will ensure they fix it with some automation to ensure it never happens again.”

What does that mean in practice? For one thing, it means Facebook puts a lot of thought into handling server failures (which is a corollary of its decision to use cheap, lowest-common-denominator hardware in its data centers; more on that below). In Facebook’s Hadoop clusters, there are always three copies of every file. Copy A and Copy B usually live within a single rack (a rack consists of 20 to 40 servers, each housing 18 to 36 terabytes of data). Copy C always lives in another rack.

A mini-database called the Namenode keeps track of the locations of these files. If the rack holding A and B or the switch controlling that rack fail for any reason, Namenode automatically reroutes incoming data requests to Copy C, and it creates new A and B copies on a third rack.

That’s all standard in Hadoop—but as you might guess from this explanation, the machine running the Namenode is the single point of failure in any Hadoop cluster. If it goes down, the whole cluster goes offline. To cover that contingency, Facebook invented yet another mechanism. It’s called Avatarnode, and it’s already been given back to the Hadoop community as an open-source tool. (Check out this post on Facebook’s “Under the Hood” engineering blog if you’re dying to know the details).

The big idea at Facebook, Janardhan says, is to “make the machine do the work.” The company’s largest photo-storage cluster is “north of 100 petabytes,” he says. At most companies, it would take hundreds of people to maintain a database even half that size. Facebook has exactly five.

The Red Button or the Blue Button?

Facebook’s infrastructure engineers may obsess about storage, but they aren’t pack rats. The whole point of maintaining such an elaborate data back end is to allow continuous analysis and experimentation on the front end.

“There are literally tens of thousands of experiments running in production across our billion users at any given time,” says Parikh. “Some are subtle and some are very significant changes. One critical part of this testing is being able to measure the response.”

Thanks to off-the-shelf tools like Optimizely or Adobe’s Test&Target, virtually any company with a website can do what’s called A/B testing or multivariate testing. The idea is to create two or more variations of your website, split up the incoming traffic so that visitors see one or the other, then measure which one performs better in terms of click-through rates, purchases, or what have you. Big Web companies like Google A/B test absolutely everything; in one notorious case Marissa Mayer, now CEO of Yahoo, asked her team to test 41 shades of blue for the Google toolbar to see which one elicited the most clicks.

Every Facebook user has been an unwitting participant in an A/B test at one time or another. In one small example from my own experience, I’m sometimes offered the ability to edit my status updates and comments after I post them, but other times the Edit option is missing. (I’d like to have the option all the time, but apparently the jury is still out on that one.) Facebook runs so many A/B tests that it has a system called Gatekeeper to make sure that simultaneous tests don’t “collide” and yield meaningless data, according to Andrew Bosworth, a director of engineering at the company. (He’s the guy who invented the News Feed.)

Sometimes the questions Facebook is testing are mundane—should a new button be red or blue? But other experiments are a great deal more complex, and are intended to suss out “what kinds of content, relationships, and advertisements are important to people,” Agarwal says. The company can break down its answers by country, by age group, and by cohort (how long they’ve been members of Facebook).

But the back end isn’t used purely to store data from active experiments. It’s also a rich mine for what medical clinicians might call retrospective studies. “If the amount of time somebody spent on the site changed from one month to another, or from one day to another, why?” says Agarwal. “What was the underlying reason for the change of behavior? At what step in the process did someone decide to pursue or not to pursue a new feature? That is the kind of deep understanding we can get at.”

The leaders of Facebook's Data Infrastructure team.

The leaders of Facebook's Data Infrastructure team. Left to right: Santosh Janardhan, Sameet Agarwal, and Jay Parikh.

Sometimes, whether out of pure curiosity or for more practical reasons, Facebook’s data scientists even investigate social-science questions, such as whether people’s overall happiness levels correlate with the amount of time they spend on the site. (At least one outside researcher at Stanford has found that there may be a negative correlation: the more upbeat posts you see from your friends, the sadder you feel about your own life.) There’s no question that the back end is mainly designed to support rapid product experimentation. But it’s not solely about “what designs work better and what don’t,” in Agarwal’s words—it’s also about “Facebook as a new medium by which people communicate, and how our social behaviors and social norms are changing.”

Overall, Parikh is certain that Facebook’s ability to experiment, measure, re-optimize, and experiment again—what he calls “A/B testing on super steroids”—is one of its key competitive advantages. It’s “the long pole in the tent,” he says. “It’s critical for running our business.”

But there’s one more type of analysis that may be just as critical. Facebook has enough servers to populate a small country, and it’s constantly collecting data on their performance. By instrumenting every server, rack, switch, and node, and then analyzing the data, the company can identify slowdowns, choke points, “hot spots,” and “problems our users haven’t even reported yet,” says Parikh.

His team recently designed one Web-based tool called Scuba to make it easier to analyze statistics about Facebook’s internal systems, such as how long it’s taking for machines in various countries to serve up requested files. Another program called Claspin shows engineers heat maps representing the health of individual servers in a cluster.

That’s the kind of thing Facebook’s infrastructure team usually has to build for itself, because “there is nothing commercially available that can handle our scale,” Parikh says. “So, analytics is something we use not only for product insight but also operational insight.”

Big Data Meets Cheap, Open Hardware

Whatever money Facebook can save by building infrastructure on the cheap goes directly to its bottom line. That’s why the company has, over the past three years, adopted another kind of build-your-own philosophy, one embraced in the past only by the largest of Internet companies—i.e., Google and Amazon. By designing its own servers and storage devices and sending the specifications directly to custom manufacturers in Asia, Facebook can now avoid shelling out for name-brand computing hardware.

In the hopes of further lowering its big data costs, the company is now trying to kindle a wider industry movement around what it calls the Open Compute Project (OCP). Announced in early 2011 and recently spun off as a non-profit corporation, OCP is dedicated to spreading Facebook’s designs for servers, high-density storage devices, power supplies, rack mounting systems, and even whole data centers. The idea is to convince manufacturers and major buyers of data center equipment to adopt the specifications as a common standard, so that everyone will be able to mix and match hardware to meet their own needs, saving money in the process.

So far, companies like Intel, AMD, Dell, Arista, and Rackspace have lent their tentative support. Name-brand makers of servers, storage, and networking devices like Oracle, IBM, EMC, and Cisco have not, for obvious reasons; in a world of disaggregated, commodity data center components, they’ll have an even harder time charging a premium.

Which is exactly the point. In the past, “Either you were so big that you could afford to build this at your own scale, or you were at the mercy of the vendors,” Janardhan says. “Now that we are publishing these specs—even the circuit diagrams for some of the machines we have—you can go to an ODM [original design manufacturer] in Taiwan and get, in some cases, 80 percent off the sticker price.”

Janardhan says he’s been stunned so far by the reaction to the Open Compute Project. “At almost any hardware industry conference I have gone to, that is the only thing people want to talk about,” he says. In a conservative business where both vendors and customers have been slow to adopt new ideas such as open source software, seeing the Facebook switch to generic, low-cost hardware has changed the conversation. “People say, ‘If Facebook can run it, why can’t we run it?’”

To Parikh, the Open Compute Project is a natural extension of Facebook’s immersion in the open source software community. “We started off many years ago doing this with things like Hadoop and Hive, but there are many other pieces of the infrastructure that we have open-sourced,” he says. (Hive is a distributed data warehouse system developed at Facebook and now overseen by the Apache Foundation; for more details on that and other tools that Facebook has contributed to the open source community, see our Facebook Big Data Glossary.)

At a January summit hosted by the Open Compute Project, Parikh said companies would have to work together to meet the challenges of big data—especially storing the 40 zettabytes, or 40,000 exabytes, of data expected to be generated worldwide by 2020. “I don’t think we are going to keep up if we don’t work together,” Parikh said.

Move Fast, Break Things

Facebook’s own ever-growing storage needs are never far from Parikh’s mind. Every month, Facebook must put another 7 petabytes toward photo storage alone, Parikh said at the OCP Summit. “The problem here is that we can’t lose any of those photos,” he said. “Users expect us to keep them for decades as they accumulate a lifetime of memories and experiences. So we can’t just put them on tape. ‘That Halloween picture from five years ago? We’ll send it to you in a week?’ That doesn’t work for us.”

Santosh Janardhan, Facebook’s manager of database administration and storage systems

Santosh Janardhan, Facebook’s manager of database administration and storage systems

At the same time, though, Facebook’s analytics data shows that 90 percent of the traffic a photo will ever get comes in the first four months after it’s been posted. Storing older, less frequently viewed photos indefinitely on the same servers with the newer, hotter photos is simply inefficient, Parikh says.

The company’s solution may be something Parikh calls “cold storage.” It would mean putting older photos into customized racks of hard drives optimized for high storage density and low power consumption, rather than quick retrieval. The less often the photos are needed, in other words, the less speedily they’ll appear. Eventually, Facebook will probably share the specs for its cold storage racks as part of the Open Compute Project.

Unfortunately, no similar tradeoffs are feasible when it comes to the activity logs that are the centerpiece of the back end infrastructure. All of that data needs to be accessible fast, and over the years, that means the back end has outgrown one data center after another, with each changeover necessitating a costly migration.

But in the last few months, Parikh’s team has been perfecting an improvement on Hadoop called Prism that could help sidestep that problem. The idea is to provide the illusion that the entire analytics back end is living in one data center, even if it’s distributed across two or more. It’s a no-more-band-aids moment. Prism will “allow people to do arbitrary analyses of arbitrarily large data sets, and prevent us from running out of capacity in a single data center,” explains Janardhan.

Facebook is one of the only big-data users that’s both building such solutions and talking about them in public. Prism means product managers at Facebook, and maybe other companies in the future, will get to ask even bigger questions and run even bigger experiments. “Every time we make [the infrastructure] faster by 10x we get 10x as much usage,” says Janardhan. “People find new things to do with the performance that we had not even thought of.”

One example: a political app within Facebook that allowed users, on Election Day last November, to say whether they’d voted. “You could see how many people were voting in what states and towns, and whether they were male or female, all the demographics, in real time, as soon as it happened,” says Janardhan. That’s the kind of information the TV networks pay exit pollsters good money for—but Facebook didn’t charge a cent. It was just a demonstration of Facebook’s big data chops.

Most organizations innovate more slowly as they get bigger. Thanks in large part to the work of the infrastructure team, Facebook hopes to move in the opposite direction. The company recently started pushing new releases of its front-end user interface twice a day, up from once a day before. “Our top priority, beyond keeping the site up and running and fast, is enabling our product teams to move at lightning speed,” says Parikh.

Sometimes that means breaking things. In a blog post last fall, Andrew Bosworth related a story about a tiny change in the way Facebook users can scroll through the list of friends available to chat interface that led to a catastrophic 9 percent drop in the number of chats initiated. “In a system processing billions of chats per day, that’s a significant drop-off and a very bad product,” he wrote. But with the analytics data in hand, Bosworth’s team was able to fix the problem within days; the new version was 4 percent better than the original.

The key to moving fast, Parikh says, is to “make sure the right guardrails are in place…so when you make a mistake, you mitigate and protect yourself from the fault. We never claim to move fast and never make mistakes.”

Having the right data on hand, in other words, enables Facebook to take greater risks—but also helps it pull back when necessary. And that, in the end, may be the biggest reason for any business to care about big data.

Continue to: Big Data at Facebook—A Glossary

Wade Roush is a contributing editor at Xconomy. Follow @wroush

By posting a comment, you agree to our terms and conditions.