One Giant Leap for Human Genomics Science and Business
[Editor's Note: This post was co-authored by Becky Drees, Mark Minie, and Richard Gayle.]
Several months back Spiral Genetics CEO Adina Mangubat lamented the difficulty of getting actionable information from now-abundant human DNA sequence data in an Xconomy post (“First Comes The $1,000 genome, Then Comes The $10,000 Analysis”. With the simultaneous publication of over 30 research papers and the activation of a novel publically available web publication and analysis tool by the National Human Genome Research Institute’s (NHGRI’s) Encyclopedia of DNA Elements (ENCODE) Consortium this month, the nature of the game has changed.
The Space Age could be said to have started in 1913 when Robert Goddard described the first multistage rocket. It would take almost 45 years before the first manmade object, Sputnik, orbited the Earth in 1957. A little over a decade later, on July 20, 1969, mankind landed on the moon, continuing a flurry of research activity that continues to this day, driving mankind further and further into the reaches of space. We are seeing a similar historic arc of progress in biology, from 1953 when Watson and Crick discovered the structure of DNA; to the initial draft of the Human Genome Project in 2000, and the new findings about DNA functionality published this year by the ENCODE Consortium. We now stand on the precipice of scientific exploration that may very well be as transformative for mankind as the first steps on the moon. And this wave of understanding will have implications far beyond the halls of academia—the new understanding of human genomics will pave the way for many useful applications in biology-based businesses.
We will discuss some immediate impacts of the ENCODE Consortium: a better definition of functional DNA; new approaches toward examining the human genome; and insights into some novel management practices used by the ENCODE Consortium.
The biochemical and computational analysis done by the ENCODE Consortium changes our view of the genomic universe and resolves the most serious roadblock to understanding human genomics by formally re-defining the gene of a multi-cellular organism as a simple, easily studied biomolecular unit—the RNA transcript of the DNA sequence.
The classic definition – that genes are “heritable units” – provides very little insight into the question of DNA function in the genome. Over the years, many attempts have been made to clarify what we mean by a gene and what we mean by function.
For example, take the vast amounts of DNA that are transcribed into RNA but removed by splicing before any protein is made. Are these parts of the genome “functional?” Are they part of the gene if they are not part of the gene’s ultimate products, the proteins, which are responsible for most (but not all) of the structural features and enzymatic functions of the cell?.
The new ENCODE Consortium data and research approach eliminates this problem by formally defining the genes of higher cells (as opposed to bacterial cells) as the RNA transcript of the DNA sequence and the regions controlling that transcription.
By doing so the ENCODE Consortium has simplified such complexities as alternative splicing, RNA editing, the fact that only 1.5 percent of the Human DNA encodes proteins, the dominance of non-coding RNAs and epigenetics into a comprehensible and usable model. It also heightens the functional importance of RNA, perhaps even over that of protein.
From thousands of genome-scale data sets, we now see millions of distinct features: RNA transcripts, transcription factor binding sites, and other functional elements. The structure of the “3-D code,” also known as the “epigenetic code” that specifies how 2 meters of DNA is crammed into a nucleus only a few microns wide, is revealed. The majority of DNA sequences—quite possibly an astounding 80 percent of the genome (!)—can now be linked to a molecular function thanks to the new ENCODE Consortium synthesis. Scientists have argued for years about the functional importance of the non-coding regions of the genome – the so-called “junk DNA.” Whether the molecular function identified by the ENCODE Consortium is directly important, indirectly important or serves no purpose at all to the cell will be the focus of years of research.
The novelty and power of these important new approaches are described beautifully and clearly by one of the ENCODE Consortium’s leading researchers, Dr. John A. Stamatoyannopoulos, from the Departments of Genome Sciences and Medicine, University of Washington School of Medicine, in an open access paper published along with the others. This in turn now makes it possible to rapidly and relatively cheaply explore the basic biology of the Human Genome, translating this new information into novel tools for biology-based businesses.
The ENCODE Consortium project will change the scope of DNA sequence analysis in research, and medicine and new areas of bioscience. To get actionable information from an individual sequence, we must first accurately detect sequence variation.
Then come the big questions: Which DNA variants have functional effects in the cell? Which are linked to disease or disease risk? How do we connect a change in DNA sequence to an effect on biochemical function, such as a misfolded protein that leads to cystic fibrosis? To be effective, we must be able to assess the cumulative impact of both coding and non-coding variants across the genome.
Complete genome sequencing is rapidly replacing the protein-focused exome sequencing now widely used in medical research, while quickly moving into clinical practice as a diagnostic tool for cancer and heritable disorders.
Exome sequencing, which targets only the protein-coding regions, is currently a favored approach. It is less expensive than sequencing the whole genome, and our ignorance of genome function makes it difficult, if not impossible, to assess the impact of non-coding variants.
The ENCODE Consortium’s annotation of functional elements will make exome sequencing largely a thing of the past. It has created a foundational dataset for assessment and interpretation of sequence variation in the majority of the genome, which goes far beyond just the protein coding regions. These new data are one of several factors that will tip the balance from exome sequencing toward whole genome sequencing. The rest of the genome—the non-coding majority—is full of functional sequence elements: binding sites for regulatory proteins, genes for functional RNAs, and organizational elements that “open” and “close” large regions of the genome.
It does not make sense to ignore non-coding sequences anymore, especially as the cost of whole genome sequencing drops and as instruments improve in their ability to process large volumes of DNA, RNA and proteins. As a consequence, the “data deluge” gets bigger, a challenge and an opportunity for bioinformatics. We will keep pace with fast, cloud-based analysis software run on cloud computing platforms to analyze whole genome datasets quickly and cost-effectively.
Most importantly, we have taken a giant step towards actionable interpretation of human genome variation. Newly discovered variants have been evaluated primarily by their predicted protein-coding effects, relying on annotation of protein-coding genes and amino acid substitution models.
The ENCODE Consortium data reveals previously hidden connections between DNA sequence and biochemical function that will be invaluable in evaluating functional variant effects and links to disease. This is powerfully demonstrated by a study published in in a recent issue of Science that re-evaluated non-coding, disease-associated sequence variants identified in Genome-Wide Association Studies (GWAS) in light of the ENCODE Consortium’s new information to find new leads to the biological mechanisms affecting disease risks, which is particularly relevant to ongoing research into Crohn’s disease and multiple sclerosis. The new ENCODE Consortium synthesis also opens up important new non-medical applications, as shown in recently published PLoS Genetics paper demonstrating that GWAS data could potentially generate facial features, eye and hair color for use in forensics from DNA sequences collected at crime scenes.
There are big opportunities in this dataset for software companies and high-performance computing operations. The ENCODE Consortium mapped thousands of new biochemical functions onto the genome. As a result, we can more completely annotate genome sequences and develop fast, cloud-based analysis algorithms trained on high-confidence data for more accurate, actionable interpretation of individual genomes.
One important consequence of the new ENCODE Consortium’s work for the biomedical science based businesses is the clear implication that RNA is the major player in human biology—not protein—as many scientists believed for decades. With many pipelines failing or drying up, this may indicate that the reason for such failures is lack of understanding basic biology and not marketing or investment strategies. It may also suggest that biopharma companies should be actively pursuing RNA based research into treatments and diagnostics rather than retreating from RNA work.
The ENCODE Consortium’s new synthesis also heralds the final demise of the old and broken Central Dogma of molecular biology and its replacement with a more accurate, robust, networked “GPS” meme proposed by Drs Eric Schadt and Rui Chang in a recent Science article.
Another major impact of the ENCODE Consortium’s work will come from the management model it used to generate these results. Just as the Space Race required a large and far-flung group of scientific collaborators, the ENCODE Consortium had to find ways to manage over 440 strong-willed individuals in a highly collaborative endeavor—more people than the vast majority of biomedical organizations have.
How did they accomplish this?
We can see hints in one of the key results reported by the ENCODE Consortium-that 80% of the genome has a specific biochemical function. One of the reasons for emphasizing this number is very pragmatic and very human.
Dr. Ewan Birney, associate director of the European Bioinformatics Institute (EBI) and one of the project’s members wrote in a blog post, “we choose 80% because (a) it is inclusive of all the ENCODE experiments (and we did not want to leave any of the sub-projects out) and (b) 80% best conveys the difference between a genome made mostly of dead wood and one that is alive with activity.”
The ENCODE Consortium could recognize all the work and all the collaborators involved in the entire project, not just those working on the ‘important’ parts. And no one felt that helping anyone else was going to hurt their own chances of getting published.
Scientific research is often a zero-sum game full of competition that works against collaboration. Seeking non-zero sum solutions not only helped solve the complex conundrum of managing over 440 people, it can also be seen in one of the truly innovative aspects of the entire project—the very open nature of the simultaneously published work.
It will take some time to digest this huge amount of work. Although almost all of the documents are free to read — an important innovation of the digital age — we are talking about several hundred of pages of dense reading.
But Nature has a very nice explorer that lets us all see which papers touch which topics.
Never in scientific history has so much data from so many papers with so many collaborators in so many locations been published simultaneously in so many journals. And it is available to all, not just a scientific elite.
Large collaborative research projects may never be the same. Small research projects can now leverage the huge amount of data from the ENCODE Consortium to explore new areas. Novel ideas can be explored by non-scientists who will now access this tremendous horde of information for free.
Fifty years ago this week, President John F. Kennedy committed the US to putting a man on the Moon by the end of a decade. The resulting Apollo program accomplished that goal both ahead of schedule and under budget.
At its peak that public-private partnership not only changed our fundamental understanding of the universe, it employed over 400,000 people at all levels of our economy (from the seamstresses at the Playtex corporation who made the spacesuits to the engineers, designers and marketers at GM and Boeing who created the amazing lunar rover). By one estimate it turned a $100 billion investment of tax dollars into $1.5 trillion in new wealth, setting the stage for the IT revolution that powered our economy for decades after Apollo ended.
We now stand on the verge of an abundant future, fueled by the technologies driving Space-X, Planetary Resources, Blue Origin and others into novel explorations of space, all based on the foundation created by NASA more than 50 years ago.
The results of the Human Genome Project are biology’s Sputnik. The ENCODE Consortium’s data and tools are the biological equivalent of Neil Armstrong’s first steps on the Moon.
There can be no doubt that these data and tools will also lead to a newer and deeper understanding of life on Earth, powering a new cycle of wealth creation in ways now unimaginable.
The data and analysis tools are now free for anyone to examine and use. Got a bright idea? Get to work.