Big data in modern biology |

Big data in modern biology

Posted 22 Apr 2013 by 

(6 votes)
open science beaker
Image by :

submit to reddit

There is now no question that genomics, the study of the genomes of organisms and a field that includes intensive efforts to determine the entire DNA sequence of organisms, has joined the big data club. The development of prolific new DNA sequencing technologies is forcing biologists to embrace the dizzying terms of terabytes, petabytes and, looming on the horizon, exabytes.

The exabyte (derived from the SI prefix exa-) is a unit of information or computer storage equal to one quintillion bytes (short scale). The unit symbol for the exabyte is EB. —from Wikipedia

The resulting cultural shift has led to a wave of immigrants into the land of pipettes and PCR tubes; the computer scientists, physicists and mathematicians, bringing their exotic and diverse analytical expertise with them. What they also brought was a tradition of adherence to the open source paradigm for software, recently adopted in the genomics world by the preconditions of funding agencies and scientific journals that data and software be shared freely.  

The success of initiatives like R/Bioconductor as a resource involving peer review and publication of software has provided a great incentive for young researchers in genomics to develop creative new solutions, the kind of productive hacking that needs to be encouraged in any big data field.

We currently seem to be undergoing one of those episodes of punctuated equilibrium in the evolution of genomics software. The shift is occurring from the prior focus on individual packages addressing specific analytical tasks to more encompassing systems that combine individual packages as workflows, while managing data and processing resources and capturing metadata.

While many separate initiatives are addressing components, the field still lacks the ideal, complete system, and as a result we have not been able to foster the irreverent and innovative approaches to the integrated system that a clandestine, off-the-books skunkworks can nurture. In short, there is a desperate, essential, need for an informatics ecosystem in genomics. The requirements for this complete system—or ecosystem—quickly enters the realm of grandiosity. Not only would it be ideal for this system to address genomics challenges, it must also ensure interoperability with other areas of digitial biology such as proteomics, metabolomics and imaging, that they also be accommodated.

We also have to face the issue that as genomics sequencing moves from the large centers and core facilities into the hands of individual researchers, the heterogeneity of hardware resources serving these analyses becomes extreme. Can an open source project address all of these sprawling requirements?

The Wasp System software project is an audacious attempt to create a foundational software ecosystem for modern molecular biology. The design is conceptually simple, based on a kernel which mediates between diverse user, processing, and data resources, and a diversity of plugin components interfacing via a common API. Engineered in Spring Framework, the Wasp System’s architecture both modularizes and abstracts many of the operational functionalities unique to Spring, to the benefit of the each functional component of a given genomics workflow, or indeed any workflow, making it so modular that it could address all genomics needs and then extend to the other areas of big data generation in biology.

Leading the project at the Albert Einstein College of Medicine in New York is Andy McLellan, who is unusual for combining a University of Cambridge doctorate in Molecular Biochemistry with a subsequent Masters degree in software development and design. He made the strategic decision to develop the Wasp System software using the Spring framework for Java, with open source development in mind. The development team also includes the head of Einstein’s Computational Genomics core facility, Brent Calder, who has migrated genomic software tools into the original Wasp environment to automate analyses for molecular biology colleagues. Their prototypic, Perl-based system focused on genomics has been functional for almost 4 years, managing and processing almost a petabyte of sequencing data over this time and creating a foundation of experience that few others can match in this young field.

Now that Wasp is rewritten in Spring/Java, the goal of the developers is to give it away to as many people as possible. Initial test partners have included the Memorial Sloan-Kettering Cancer Center in New York, the University of California San Diego, and the Australian Genome Research Facility. Once 'out in the wild,' the open source approach relies at least partially on volunteers to act as curators, but the maintenance plan for the Wasp System is more overtly structured, what the Einstein group calls a nurtured open source model. Component plugins are developed by community users to suit local institutional needs, but are tested by the Einstein group for forward compatibility with the latest Wasp software versions, available on Github.

In return for making the plugin forward compatible with Wasp, the participating developer makes their component available to the entire Wasp user community, a model designed to expand the capabilities of the system as a whole quickly. Wasp also addresses the challenge of increasingly diverse hardware resources available to process the data generated, again taking advantage of the fundamental design envisaged by Andy McLellan and Brent Calder. The processing scheduling component of Wasp is essentially agnostic as regards hardware implementation, and as such, anticipates future trends towards cloud or grid-based computing. By taking advantage of Spring’s capabilities, it is possible to build the basis for a distributed peer network, where instances share and provision data intuitively and respond to data generation by launching appropriate analytical pipelines on HPC resources, while reacting appropriately to computational and data-related errors. In this regard, the Wasp System encompasses a real-time messaging system, a responsive workflow and pipeline system, and an interface to HPC assets, designed around best practices for genomics analysis and the challenges of big data in the wider digitized life sciences.

With a powerful and flexible foundational software system like Wasp in place, what should emerge as a consequence is a sandbox for genomics hackers, where pipeline components are juggled, and innovative visualization tools are developed and implemented. A major emphasis for the Einstein group is the encouragement of Wasp’s use to host this kind of skunkworks.

The goal is to use Wasp to nurture the irreverent disregard for convention not currently possible using systems that are more rigid or of more limited scope.  In this way we can hope to overcome big data challenges and flourish in the current era of genomics discovery unlocked by DNA sequencing technologies of unprecedented power.

submit to reddit

Trained as a physicist (B.A.), computational scientist (M.Sc.) and astrophysicist (Ph.D.). Prior to joining Einstein faculty ran concurrent research programs in astrophysics (pulsars, brown dwarfs, virtual observatories), bioinformatics (data mining, genomics) and high performance computing (time series analysis, light scattering from particulates). Namesake is also known as minor planet 11451.

Top 5 open source project management tools in 2014