Genomic Advances of the 2000s Will Demand an Informatics Revolution in the 2010s
We have witnessed some of most striking technological and scientific innovations in humankind during the first decade of the new millennium. While such claims perhaps seem cliché in an age where the media constantly report on new findings that really do not warrant our full attention, several discoveries and innovations in the recent history of genomics were truly groundbreaking and will have long-lasting implications.
The expanding applications of genomic technology that will help us better understand causes and treatments of common human diseases, global warming, and hunger will become clear in the coming decades. The innovations most impressive to me in the past decade were those that have begun to shake many of the foundations upon which the life sciences and biomedical research have been built. Here are what I consider four of those more impressive discoveries:
1) The discovery that environmental stress can induce heritable DNA-based changes.
2) The maturation of highly parallel sequencing and genotyping technologies that have revolutionized our ability to associate changes in DNA with disease.
3) The discovery of whole new classes of RNA that do not carry out instructions from genes, yet are still critical to cellular and higher order biological processes.
4) The development of third-generation DNA sequencing that will lead to greater insights about underlying biology.
As our ability to capture data from entire genomes increases exponentially, this is creating a huge software and computing challenge. Life sciences and biomedical researchers will need novel solutions (a yet to come fifth innovation):
5) The translation of the deluge of data coming from the new discoveries and technologies into actionable results that can impact human wellbeing.
This will be a big trend to watch in the coming decade, but more on that later. First, I want to explain a little about why I’m singling out these four particular discoveries and technologies as groundbreaking:
1. Environmental stresses can induce heritable DNA-based changes.
In 2005 Michael Skinner, a professor at Washington State University, published a paper in Science demonstrating that in response to exposure to an endocrine disruptor (a common environmental toxin), DNA can be chemically modified in certain locations and that these modifications can affect the ability of the biological machinery within the cells in every bodily organ to read the modified DNA. Reading DNA is a necessary first step for cells to manufacture the proteins needed to drive normal biological processes. Chemical modifications of DNA induced by environmental toxins have been shown to influence many of the common human diseases that are of significant public health concern today, such as type 2 diabetes and cancer.
While this finding on its own was not so surprising, the astonishing observation was that these chemical modifications to DNA can be transmitted to subsequent generations, even after exposure to the agents inducing the changes were stopped. Skinner later demonstrated that these types of environmentally induced changes could affect fundamental behaviors like mate selection, demonstrating a potentially more rapid evolutionary selection mechanism that does not require mutations in the actual DNA sequence.
A recent related discovery published in Nature by Decode Genetics, an Icelandic company that has helped lead the way in establishing how changes in DNA associate with disease, demonstrated that mutations in the sequence of DNA that are inherited from, say, a mother, can have very different consequences relating to disease risk and progression than the very same mutations inherited from the father.
2. Highly parallel sequencing and genotyping technologies have revolutionized our ability to associate changes in DNA with disease.
The maturation of second generation, highly parallel DNA sequencing and genotyping technologies, along with the completion of the sequencing of the human genome, has enabled an astonishing wave of discoveries about how the specific forms of DNA inherited from our parents can cause disease or differences in our response to treatments. While hundreds of examples of rare, single gene mutations in our DNA that cause disease have been discovered over the past 30+ years, finding common changes in DNA that affect our risk of disease turned out to be incredibly difficult. Before this past decade, only a handful of examples of genetic risk factors existed for common human diseases. However, technologies able to fully characterize all of the common DNA variation in the human genome at lower cost have dramatically increased the number of causal genes identified. Scientists now have catalogued nearly a thousand genes in which common DNA changes affect the population risk of more than one hundred different disease associated phenotypes, including those associated with type 2 diabetes, heart disease, multiple different types of cancer, arthritis, Crohn’s disease, schizophrenia, and Alzheimer’s disease, as well as other human traits like height, eye color, and hair color.
While this wave of discovery has been truly impressive, few of the DNA changes were found to directly affect the function of proteins directly implicated in diseases like Alzheimer’s. In fact, most changes in DNA associated with common human diseases appear to be affecting the rate at which genes represented in the DNA are transcribed into RNA and then translated into proteins (as opposed to directly affecting the function of the protein). Further, these findings actually turned out to explain very little of the disease variation in the human population. That is, while these DNA variations were associated with disease, they were unable to explain very appreciable amounts of the overall disease variation in the human population. This has prompted a new search in the life sciences for the “missing heritability” relating to human disease. Given the low percentage of variation explained by common, simple variations in DNA, the hunt is on for other types of variation (including the environmentally induced changes mentioned above) that had not been thought to play a key role in disease, but that now may represent some of its significant explanations.
3. Whole new classes of RNA discovered to be critical to cellular and higher order biological processes.
Emerging from recent genetics research is a greater appreciation that in order to understand and treat disease, we will need to fully characterize the role that whole new classes of non-coding RNA discovered over the last 10 years play in biological processes. While non-coding RNAs like ribosomal RNA were discovered long ago and shown to be responsible for translating protein-coding RNAs into protein, completely new classes of non-coding RNA have been discovered that are widespread and have been shown to have regulatory roles for entire networks of genes associated with disease. In fact, one particular class of non-coding RNA known as microRNA has not only been well demonstrated to affect processes that cause disease, but is now being pursued as a way to treat it as well. Despite hundreds of thousands of copies of some microRNAs existing in our cells, it was not until this past decade that we discovered these molecules and their effect on critical biological processes.
4. Third-generation DNA sequencing will enable greater insights about underlying biology.
Technologies brought to market in the last decade have enabled amazing discoveries, but they have also shed light on how much we still don’t know and need to learn in order to develop more effective strategies for preventing and treating disease. In order to truly make a difference to improving patient care, scientists need access to fast, accurate and comprehensive snapshots of the underlying biology of living systems. One of the more impressive technologies developed this past decade toward this end was single molecule, real time (SMRT) sequencing.
SMRT sequencing was invented by a group of scientists at Cornell University and is now being developed and commercialized by Pacific Biosciences (a biotechnology company formed by Stephen Turner and some of his colleagues from Cornell University, which I joined this year as chief scientific officer). The technology employs waveguide transmission below cutoff technology to directly observe the activity of DNA polymerase as it sequences DNA. This technological advance enables the observation of nature’s own amazing sequencing engine as it very rapidly sequences DNA. Observing DNA polymerase as it sequences DNA stands in contrast to the heavily engineered second generation systems that have relied on brute force approaches to sequencing rather than nature’s own highly evolved and efficient approach.
SMRT sequencing will enable sequencing of an individual’s complete DNA sequence very quickly and for little cost over the next decade. For example, current technologies take roughly one hour to sequence a single letter from a fragment of DNA, whereas SMRT sequencing can sequence roughly 20,000 letters of the fragment in the same period of time. The system has been designed to observe many of these DNA polymerase molecules at the same time, sequencing many fragments simultaneously, which will ultimately enable the observation of hundreds of gigabases of DNA per hour. This level of unprecedented speed and efficiency in genome sequencing is expected to finally make personalized medicine a reality.
5) Needed: Informatics innovation to translate the data deluge.
Third-generation technologies will enable sequencing every individual in large populations and that will create unprecedented amounts of data, rivaling all other areas of science with respect to quantity and complexity. So the real challenge in the next decade will be informatics based. How will petabyte scales of complex data be managed and integrated so that predictive models of disease can be constructed and routinely applied? While companies like Google routinely play in the space of petabyte scale data sets, the problem they have solved is far simpler than understanding how all DNA variations, RNA levels and isoforms, metabolites, and proteins interrelate to one another across all of the different environments that give rise to life.
Only by marrying information technology to the life sciences and biotechnology will we realize the astonishing potential of the vast amounts of biological data we will be capable of generating. Such data, if properly integrated and analyzed, will enable personalized medicine strategies that lead to every one of us making better choices on how we not only treat disease, but prevent it altogether.
[Editor's Note: This is part of a series of posts from Xconomists and other technology leaders from around the country who are weighing in with the top innovations they've seen in their respective fields the past 10 years, or the top disruptive technologies that will impact the next decade.]