Big data. It's one of the most pervasive buzzwords in today's technology world. But it's impossible to deny how deeply data touches all aspects of not just our lives but also business and industry. The amount of data collected about everything is staggering—a typical transatlantic flight generates 30 terabytes of data about the engines alone!
At this year's Great Wide Open conference, Milind Bhandarkar, chief scientist at Pivotal, gave a talk focusing on the evolution of how enterprises collect and store data. The crux of his talk was that open source software, in the form of Hadoop (a framework for storing and processing large data sets on clusters of hardware), has had a huge impact on that process, and will continue to do so.
Bhandarkar kicked off his talk by explaining how Web 2.0 and mobile changed the way people look at data. Websites collect information about users and what their interested in. Mobile devices are constantly transmitting about their usage. Social media generates a social graph, which:
became the the most interesting type of data to analyze.
All of that led to big data, about which Bhandarkar said:
I always refer to "big data" in quotes because no one really knows what it is.
When big data started to become a concern in the early 2000s, the software landscape was dominated by a single, expensive database system. But, Bhandarkar said, open source software helped change the database and analytics landscape.
As a Hadoop midwife, Bhandarkar has a unique perspective on how the framework became so entrenched in the world of big data and how effective it is. Initially developed at Yahoo! and based on a number of open source tools, Hadoop was the basis of the project to index the entire web in one week. The framework cut its teeth indexing and analyzing the Internet Archive which, at the time, weighed in at 20 terabytes.
Since 2007, the use of Hadoop (in Bhandarkar's words) has exploded. The companies using and building on Hadoop include firms offering analytics infrastructure, operational infrastructure, storage, and more. According to Bhandarkar:
Hadoop has become a very promiscuous ecosytem.
The widespread uptake of Hadoop is partly because organizations can use it to quickly glue several disparate software solutions—like storage and databases—together. Hadoop's flexibility enables an enterprise to continue using the software and infrastructure it already has in place, while enhancing and extending that infrastructure.
Hadoop has also helped change the way data scientists perform analytics. In the past, analysis required many steps. A lot of data was tossed away because it didn't fit the structure of the analytical process. With Hadoop, Bhandarkar said, a lot of analytical grunt work can be done before saving data to the database. Only data that an enterprise can act upon is saved. This speeds up analysis and reduces storage costs.
Bhandarkar concluded his talk by illustrating how more and more analytics tools are moving to Hadoop, and how Hadoop is rapidly becoming an analytics platform that supports multiple types of data and multiple scales of data.
His final thought: because of its power and flexibility, Bhandarkar sees Hadoop becoming the single analytics platform of the future.
Milind Bhandarkar was the founding member of the team at Yahoo! that took Apache Hadoop from 20-node prototype to datacenter-scale production system, and has been contributing and working with Hadoop since version 0.1.0. He started the Yahoo! Grid solutions team focused on training, consulting, and supporting hundreds of new migrants to Hadoop. Parallel programming languages and paradigms has been his area of focus for over 20 years, and his area of specialization for PhD (Computer Science) from University of Illinois at Urbana-Champaign. He worked at the Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems, Pathscale Inc. (acquired by QLogic), Yahoo! and Linkedin. Currently, he is the Chief Scientist at Pivotal (formerly, Greenplum, a division of EMC).