Things Fall Apart: Amazon’s Epic Cloud Failure Reveals Shortsightedness by Some Other Well-Known Tech Companies

As this week’s massive failure of Amazon Web Services cloud-computing infrastructure continued to roil the Web today, a few things were sadly clear. Perhaps most striking of all: Major service providers and websites—companies with enough money and talent to avoid the problem—didn’t spend nearly enough energy planning for the inevitability of a breakdown.

Yes, it’s probably asking too much for many small startups to double up their cloud-computing spending to prepare for what has been, up to now, a very rare outage from one of the biggest players in IT infrastructure. But what about the service providers that harvest money from that long tail of little companies?

Exhibit A is San Francisco-based Heroku, the hugely popular development and hosting platform that relies on Amazon’s service. When Amazon went down, Heroku went with it—taking along more startups than it had to, if the middleman had hedged its bets more effectively. It’s possible that Heroku had plans to move in that direction eventually, but that obviously that hasn’t happened fast enough to avoid a devastating outage.

That’s not to excuse Amazon’s meltdown or spotty communication with affected parties, which has magnified the problem. But remember that Heroku is no scrappy little startup—the company was purchased just a few months ago by Salesforce.com (NYSE: CRM) for more than $200 million.

“For someone like Heroku, which literally hundreds of startups use—Heroku should start thinking about, ‘OK, what can we do to spread the risk?'” says entrepreneur Shyam Subramanyan, whose San Francisco Bay Area startup list.ly was shut down by the outage. “A lot of people are paying them money.”

Scott Sanchez from CloudNod.com pointed to Los Gatos, CA-based Netflix as a good example of a large company making sure it had redundant protection. “They’re charging half the world $9.95 a month. It’s important for them to stay available, and they invested in the proper architecture. And they didn’t have to point any fingers,” Sanchez says.

On the other side of the coin were companies like popular content-aggregation website Reddit, which was still in a bare-bones mode Friday afternoon because of Amazon’s cloud problems. Reddit, based in San Francisco, sells ads and paid subscriptions, and is owned by media giant Conde Nast, which reportedly has been talking to investors about buying a stake at a potential $200 million valuation. “They can’t blame it on the money,” Sanchez says.

“People are quick to point fingers at Amazon or whoever their cloud provider is—‘They’re down, and they took us with them.’ You see that on Twitter, if you look, from dozens and dozens of brand-name sites,” Sanchez says. “At the end of the day, that’s true. You’re down because of Amazon. But in reality, your whole business relies on being up. And there’s 100 different ways that they could have avoided having to say, ‘We’re down because of Amazon.'”

Whenever Amazon gets its service back together, it’s a reasonable bet that alternate providers will see some new customers. At the very least, Amazon-dependent companies will probably switch to or add computing power in other areas of Amazon’s network.

But it’s unlikely that Amazon’s position in the cloud-computing sector will weaken in a major way—even companies that were crippled for hours by the outage were pretty deferential, like Palo Alto, CA-based Quora’s error-page message that said “We’d point fingers, but we wouldn’t be where we are today without EC2,” the shorthand for Amazon’s Elastic Compute Cloud.

As Palo Alto entrepreneur Semil Shah noted on Twitter, “Perhaps the company most vital to Silicon Valley startups isn’t even located here—it’s in Seattle.”

“In a way, it’s kind of amazing that people trust EC2 so much, and EC2’s record of being available for so long without any major failure,” Subramanyan says. “I think it’s actually pretty amazing that these companies have failed to diversify their vendors.”

Trending on Xconomy

By posting a comment, you agree to our terms and conditions.

  • Two things that elevate this incident past a simple outage. One is the number of sites affected. Since so many sites have Amazon ads Web Services, a host of sites were down, many major high traffic sites were down for hours, possibly days. Have you seen any estimates on the amount of potential dollars lost because of the failure? Had to be many millions of dollars in revenue lost.

  • Here’s one you didn’t see and this makes me uneasy in healthcare to see anything that has a tiny pulse getting funded. Is there enough due diligence today with investors or are the start ups disclosing enough information?

    As you can read from the forum here an SOS was put out for help as Amazon was nowhere to be found and hundreds of cardiac home patient were not being monitored. This is a big deal with this kind of failure and Amazon is the secondary problem, but first of all the company had no fail over plan, mistake number one.

    http://ducknetweb.blogspot.com/2011/04/what-happens-when-cloud-server-goes.html

    Again, this leads me to the 2 questions above as health services will grow and rely on cloud structures and servers, that’s a fact so how did happen here?

    As of yesterday I didn’t see any relief and when a customer puts out an SOS complete with account numbers and so forth, we have problem. I sent it off to the FDA since they just somewhat relaxed their class 1 devices rules, which in essence there are devices bringing in the data to the clouds. We don’t know here who is the ultimate recipient of the information, family, a doctor, medical records, or whatever. This is one to think about seriously.