Data Domain Founder, Kai Li, on EMC Acquisition and the Future of Data Storage

Now I know why venture capitalists walk the halls at the University of Washington—you never know who you might run into. My timing was impeccable yesterday as I sat down with Kai Li, the co-founder and chief scientist of Data Domain (NASDAQ: DDUP), the Santa Clara, CA-based data storage company that just got bought by EMC (NYSE: EMC) for $2.1 billion in cash.

Li, who is a computer science professor at Princeton University (he has been visiting the UW for the past year and has some strong Seattle connections), made time for me despite his busy schedule. The deal with EMC has been in the works since June 1, when the Hopkinton, MA-based data storage and management giant launched its bid to acquire Data Domain despite a pending acquisition attempt by rival NetApp (NASDAQ: NTAP) initiated in May. Many twists and turns ensued, culminating in yesterday’s announcement by NetApp that it had taken itself out of the running, clearing the way for EMC’s takeover, at a bid of $33.50 per share.

Data Domain’s story is a compelling one. Li co-founded the company in 2001, together with Brian Biles (currently vice president of product management) and Ben Zhu (former chief research officer), with the idea of developing advanced “deduplication” software to get rid of redundant data before it gets stored, thereby saving companies storage space, time, and money. Li served as chief technology officer and CEO in the early days of the company, but since 2002 has been a consulting chief scientist and director. Over the next few years, Data Domain gained traction in the data backup and disaster recovery market and went public in June 2007, raising more than $110 million in an IPO.

Kai Li, co-founder of Data DomainIn a wide-ranging interview, Li (left) talked about Data Domain’s technical approach, its market strategy, a little bit about the EMC deal, and the broader future of data storage. Here is an edited account:

Xconomy: So how does the EMC acquisition affect you?

Kai Li: I don’t know yet. EMC has been the leader in storage systems in general. They’re bigger than other players in the storage market, comparing with NetApp, IBM, HP, Dell, and Sun (now part of Oracle). EMC is the premier storage vendor for data centers. We haven’t been communicating with EMC because of the definitive agreement with NetApp, so I haven’t talked to EMC yet.

X: How does the deal affect Data Domain’s operations?

KL: EMC has written a letter to Data Domain employees. They said they’ll keep Data Domain as an independent business unit. I think it’s a good strategy.

X: Tell me about Data Domain’s key technology. Where did it come from?

KL: In IT, there are three big pieces interacting together in data centers. One is servers for information transformation. This is about computing—we can make faster and faster computers every year to help us transform information. Whether it’s mathematical models, simulations, or games, all of those involve transformation; we always need more compute power. The second piece is moving information—this is communication. This industry is quite big, led by Cisco. We need faster, safer ways to move information. The third thing is storage. We have to store and protect data. Our society is so related to data now. Almost all important information is digitized. Think of how we run businesses today. Even each person has lots of data generated from multiple devices and computers. Storing and protecting information are becoming very important.

The data growth rate has been really high, roughly on the curve of Moore’s Law. This poses a lot of issues. As the data keeps growing, we need to update our storage systems. We have to back up data to provide data protection to satisfy users’ requirements. So one thing I was thinking about in early 2001 was to address this issue. The key question is to figure out the most painful things people are doing in data centers.

X: So you got into disaster recovery for data centers?

KL: Yes. Probably the most painful thing we were dealing with was backups. Data centers were using tape libraries for decades. Many things are bad about using tape. After writing data to tape, you may not be able to read it back in reliable form. And to satisfy requirements for disaster recovery, you need to ship data to a remote site. After California implemented the legislation that if a company loses customer records, it has to make that information public, we’ve seen a lot of news about banks and big organizations losing customer data due to tape transportation—moving tape offsite. They need to do this every day. Since humans are handling this, there’s a non-zero probability something will go wrong.

I started thinking about whether we can develop a new kind of product that can replace tape forever for data protection purposes. That’s when we invented deduplication storage systems, with my co-founders. The system we invented is able to achieve lossless compression of roughly 20-to-1. If you can shrink the footprint of data by that much using a disk-based storage system, you can compete in price with tape library solutions. When the cost is roughly equal, the value propositions of fast and reliable recovery, and automatic disaster recovery, became very appealing to customers.

What separates Data Domain from many startups is how we invented the technology. We identified a very painful problem in data centers first, and invented technology to solve the problem, instead of inventing a technology and then looking for a market. Because of this, we are essentially executing the same business plan as when we formed, during the last downturn, one month after September 11 [2001]. That’s one of the main reasons Data Domain was able to lead the market, and that’s why deduplication is becoming such an important technology.

X: How does deduplication work, and how is it different from regular data compression (like WinZip)?

KL: Data compression has been used since the late 70s. The main observation we had was that the previous “local compression” had been doing encoding within a small window of bytes, such as 100 kilobytes and try to encode that. This method achieves roughly 2-to-1 compression [half the data] on average. Deduplication is fundamentally different. Instead of looking at a 100 kilobyte window, you make the window really large—as large as the entire storage system, or a network of storage systems. By doing so, we can reduce the data footprint by an order of magnitude. Then the challenge is how to keep track of the data segments, and how do you find the duplicates at high speed.

X: Why is this such a big deal for companies?

KL: This is a classical disruptive technology in IT. By disruption, I mean replacing the existing infrastructure, as opposed to incremental improvements. We disrupted tape libraries and disk-based storage for backup, near-online, and archival use cases. Because deduplication reduces the data footprint by an order of magnitude, it brings substantial value to large data centers.

Deduplication solved three problems. The first is to get rid of tape infrastructure for backups. When you tell data center customers you’ll replace backup libraries, that’s [an easy sell]. The second is to move data offsite easily. When you compress data, you can also move your data to a wide area network more easily. Especially for corporate intranets, the cost for bandwidth has not been reduced much in the past 10 years. If you use that T3 [communication] line to move uncompressed data, it’s not feasible. T3 moves about half a terabyte per day and costs about $72,000 a year. In the case of an Oracle database, you limit your database to half a terabyte if you want to do full backup every day. If you translate that to dollars per gigabyte, it’s $300 per gigabyte for two years—that’s more expensive, by more than an order of magnitude, than primary storage. The situation gets worse since the number of hours in a day does not increase, while data volume keeps increasing. But with deduplication, it’s the same cost as moving tape by physical transportation. And the third problem it solves is to store near-online data, which is the infrequently accessed but majority data. For “nearline” data, we can provide customers with a very economical storage system. This is especially helpful during an economic downturn.

X: So what stage is deduplication at with big data centers? Is it becoming mainstream?

KL: Deduplication is still at a relatively early stage. If you look at the tape library market, it’s about $3 billion. The consensus is data deduplication will become a multibillion-dollar market. Data Domain did $274 million in revenue last year. This year, the guidance is in the ballpark of $360 million. Data Domain is arguably the leader in the deduplication storage market. You can see the market is growing into a billion dollars a year soon.

X: But the field is crowded with competitors, including your new parent company.

KL: There are many players in deduplication storage. Their go-to-market strategies are different. Avamar, acquired by EMC [in 2006], was one of the early competitors. What they have been doing is applying deduplication technology to backup software. That’s what EMC has been selling to data center customers in remote office situations, where you reduce the amount of data you have to move from your source to a data center.

Symantec and CommVault recently introduced deduplication technology to their backup software products. Several players are making deduplication storage in a virtual tape library system. Their idea is to make a disk system look like tape; you can roll a new storage system in easily. Diligent [acquired by IBM last year] is one. Another player is Quantum. Their deduplication storage systems started shipping two or three years ago. Data Domain started selling products in 2003. We were the first company to sell a deduplication storage system. Meanwhile, HP has developed their own deduplication product for low-end, remote offices. And Dell is also planning to sell deduplication storage systems.

X: What is Data Domain’s—and now EMC’s—main competitive advantage in this space?

KL: Data Domain has been very customer focused, we make a product that’s very easy to use. And Data Domain’s technology has been superior to competitors’. One of the reasons our technology is better is that our software architecture was designed with parallelism built in from Day One. We were betting on multicore CPUs, rather than betting on many disks, many spindles, to achieve scalable throughput. As long as Intel and others keep making progress on CPUs, our technology can translate the increasing CPU power into increasing deduplication throughput. In the product line at Data Domain, you can see that. The current product runs 700 megabytes per second. The previous one runs 350 megabytes per second, from the year before, and so on. It’s essentially on a Moore’s Law curve.

X: So, faster, cheaper, and more efficient storage and backup. How will this affect the data storage industry more broadly?

KL: Deduplication is going to reshape the storage industry. If you look at storage media, we currently see a hierarchy, where the bottom is probably tape. With deduplication, tape use will be substantially reduced—maybe in time it will disappear. The next one is high density disks such as SATA disks which are 1.5 terabytes for $200. Then, fiber channel disks—high rotations per minute, they don’t have a lot of density, but you can use the disks to run a higher number of transactions per second with a database system. Then, solid state disks. Between those four kinds of storage media, the cost factor is 3-5 between each level.

High-density magnetic disks will stay because they’re inexpensive, and we have a lot of data to put there. But if deduplication storage technology can be applied to solid state disks, and compress the data by a factor of 3-5, we can reduce the cost to that of fiber channel disks. When this happens, fiber channel disks may disappear. So, deduplication is impacting the storage community in multiple directions. It may be going into primary storage systems. But a lot of work needs to be done.

X: What is Data Domain’s biggest challenge going forward?

KL: The general question is how to apply deduplication to other use cases. Currently, the primary use case has been backup and disaster recovery. Data Domain has moved into nearline and archival storage use cases. There’s still a lot of work to be done, though. How to attack those markets, and others, is the general question. The value proposition is very clear. It’s essentially translating computing power into storage and bandwidth reduction. Computing power is getting cheaper and better. Can we translate that into less storage, and less network storage needed? We are on the roadmap to do better, year after year.

Gregory T. Huang is Xconomy's Deputy Editor, National IT Editor, and Editor of Xconomy Boston. E-mail him at gthuang [at] Follow @gthuang

Trending on Xconomy

By posting a comment, you agree to our terms and conditions.

  • Bill Ghormley

    Thanks Greg — tape backup has always been an achilles heel — this is very promising technology — and its clear that EMC needed to be a primary player in this space. BG