Microsoft Rolls Out Tools to Help Scientists (and Eventually Companies) Manage Data Deluge
From the seas to the stars, Microsoft Research is trying to increase its impact. The Redmond, WA-based computer science research organization is releasing new software tools aimed at helping scientists manage and visualize huge amounts of information, and make discoveries in fields as diverse as astronomy and oceanography. The announcement of the free tools, called Project Trident, is being made today at the 10th annual Microsoft Research Faculty Summit in Redmond.
Everyone knows information overload is a huge issue. Just try being a scientist these days. With increasing amounts of data available from the Internet, satellites, telescopes, cameras, gene sequencers, and networked sensors, researchers—and organizations in general—are looking for ways to cut through the deluge and focus faster on doing the analysis and getting results, rather than sorting through data.
It’s also a problem faced by big companies, financial analysts, and medical institutions. So, ultimately, Project Trident is not aimed at spearing purely scientific research problems—it’s software that also could yield big results for business down the road. “If we look back at the challenges faced in business, scientists were facing them years if not decades before,” says Roger Barga, a Microsoft researcher and principal architect on Project Trident. “We’re getting an early look at what our business customers will expect in their products in 3-5 years. It’s pushing another Microsoft [Windows] platform into new areas.”
Project Trident started around 2006, when Barga began collaborating with legendary Microsoft researcher Jim Gray (who was lost at sea in January 2007) on tools to help oceanographers make sense of volumes of data on things like temperature, salinity, and the physics of seafloor hydrothermal vents. “There’s a clear understanding of the science and how to put instruments in the ocean, but there’s a gap in how to convert data streaming in from the ocean to useful analysis,” Barga says. “Jim had this vision of an oceanographer’s workbench. So we said, ‘Let’s get involved in this. Let’s do a proof of concept.’”
Barga got a couple of interns from the University of Washington to work on the project, and by the summer of 2007, they had a working demo. The idea of the software is to help people manage the workflow between data collection and analysis—coordinating a sequence of steps to be taken with the data. It’s not a revolutionary algorithm, but it’s a way to break the process into manageable chunks that can be reused and recombined, so you don’t have to start from scratch or hire a programmer every time you want to manipulate your data in a new way. The tools are built on top of Microsoft’s Windows Workflow Foundation (making use of Microsoft SQL Server and Windows HPC Server cluster technologies), and they include advanced gaming graphics tools to display what’s going on in your data.
The reception in academia has been very positive, says Barga, a 13-year Microsoft veteran whose group now totals seven people. All told, Microsoft has put well over $1 million into Project Trident, counting the researchers’ time. The next step is to get more scientists to use the tools, and to share their work. Currently, ocean researchers from the UW, Monterey Bay Aquarium Research Institute, and other institutions are using Trident tools as part of the Neptune oceanographic project for networking the seafloor, funded by the National Science Foundation. And astronomers at Johns Hopkins University are using the software as part of their Pan-STARRS project to detect objects in the solar system that could pose a threat to Earth.
There are plenty of other efforts to build scientific data management tools, of course. Some examples are Taverna, a UK-based project specialized for bioinformatics, and California-based workflow software projects Kepler and Pegasus, developed by academics. “We collaborate with them,” says Barga. “We wanted to show you don’t have to build these systems from the ground up.”
Barga says other organizations are getting interested in all this too. “We have medical research groups and financial analyst groups talking to us about it,” he says. But many challenges remain when it comes to dealing with data. For example, Barga says, some tasks are just too big for the computer on your desk to handle. You might need to manipulate a whole data center, say. That’s why Microsoft is also announcing a new programming language, called Dryad, which is specialized for doing high-performance computing across parallel and distributed systems. It could come in handy for large-scale studies that involve searching, filtering, and aggregating data on topics like social networks or broad economic trends.
As researchers and businesses think increasingly globally when it comes to data, you can bet there will be a big role for companies like Microsoft in providing the key tools of the trade. “Our ability to collect data will outpace our ability to analyze it,” Barga says. Computer science is going to be driven and challenged to visualize and analyze [more] data. It’s a big enabler.”