Open source bioinformatics data platform gets helps from student hackers

Image by:

Opensource.com

Bio4J was selected to be part of Google Summer of Code 2014 this year, and what began this summer has recently culminated in great success, after months of work by the Era7 Bioinformatics team.

At Era7 Bioinformatics, we are a bioinformatics company specializing in sequence analysis, knowledge management, and sequencing data interpretation. Our mission is to help our customers obtain the maximum value from their Next Generation Sequencing projects. And, Bio4j is our high-performance, cloud-enabled, graph-based, and open source bioinformatics data platform, integrating the data available in the most representative open data sources around protein information. It integrates the data available in UniProt KB (SwissProt + Trembl), Gene Ontology (GO), UniRef (50, 90, 100), RefSeq, NCBI taxonomy, and Expasy Enzyme DB. The current version has more than 2,000,000,000 relationships, 400,000,000 nodes and 1,000,000,000 properties. Bio4j provides a completely new and powerful framework for protein related information querying and management. Since it relies on a high-performance graph engine, data is stored in a way that semantically represents its own structure. On the contrary, traditional relational databases must flatten the data they represent into tables, creating artificial ids in order to connect the different tuples; which can in some cases eventually lead to domain models that have almost nothing to do with the actual structure of data.

If you aren't familiar with the successful and popular Google Summer of Code (GSoC) program, it is a 10 year-old global program that offers funding to leading open source projects from various fields. Funding is given directly to students to help them create new functionalities or improvements for the selected open source projects. To celebrate the success of the program this year, Google organised a meeting at its headquarters from October 23 - 26 and invited delegates from each successfully participating organisation to greet and collaborate. Two Era7 Bioinformatics delegates attended the event at Google’s Mountain View offices and participated actively in the different activities organised by Google.

"This project has been a great opportunity to make our Bio4j platform an even more useful and valuable tool that we use under the hood of many of our pipelines and services, like BG7 and Genome7," said Eduardo Pareja, CEO of Era7 Bioinformatics. "Based, in part, in these improvements, we can offer now tailored Bio4j based services to be used by other parties in their bioinformatics solutions," added Dr. Pareja.

This was Bio4j’s first year as a GSoC organization and was in charge of mentoring three students who worked on these projects:

Dynamograph, a simple graph database based on DynamoDb that offers the possibility of persisting and retrieving data organized in graph structures.

Bio4j Graphml/GraphSON exporter, a plugin for Tinkerpop3's Gremlin Console that provides Traversal Steps implemented in Bio4j's Domain Specific Language and the:bio4j console command. The :bio4j command allows you to export queries expressed in the Gremlin Graph Querying Language or in the Bio4j DSL to GraphSON or GraphML formats.

GSoC 2014 el-grafo project, the first development of an interactive web-based tool that allows users to intuitively explore the abstract domain model of the Bio4j open source bioinformatics data platform.