Hadoop, an open source software framework with the funny sounding name, has been a game-changer for organizations by allowing them to store, manage, and analyze massive amounts of data for actionable insights and competitive advantage.
But this wasn't always the case.
Initially, Hadoop implementation required skilled teams of engineers and data scientists, making Hadoop too costly and cumbersome for many organizations. Now, thanks to a number of open source projects, big data analytics with Hadoop has become much more affordable and mainstream.
An early problem with Hadoop was that while it was great for storing and managing massively large data volumes, analyzing that data for insights was difficult. Only skilled data scientists trained in writing complex "Java Map-Reduce" jobs could unleash Hadoop's analytics capabilities. As a solution to that problem, two data scientists at Facebook, Ashish Thusoo and Joydeep Sen Sarma, who later went on to found the cloud-based Hadoop big data analytics service called Qubole, created Apache Hive in 2008.
Capitalizing on the ease of use of Structured Query Language (SQL), a language that requires relatively little training and is widely used by data engineers, Hive uses a language called HiveQL to automatically translate SQL-like queries into MapReduce jobs executed on Hadoop. Because SQL is the preferred data language taught in schools and used in the industry, Hive, which put SQL on top of Hadoop, transformed Hadoop by making its formidable analytics power more readily available to people and organizations, not just developers. Hive is best used for summarizing, querying, and analyzing large sets of structured data where time is not of the essence.
While Hive on MapReduce is very effective for summarizing, querying, and analyzing large sets of structured data, the computations Hadoop enables on MapReduce are slow and limited, which is where Spark comes in. Developed at UC Berkeley's AMPLab in 2009 and open sourced in 2010, Apache Spark is a powerful Hadoop data processing engine designed to handle both batch and streaming workloads in record time. In fact, on Apache Hadoop 2.0, Apache Spark runs programs 100 times faster in memory and 10 times faster on disk than MapReduce.
The advantage for users is that Spark not only supports operations such as SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms, it also allows these multiple capabilities to be combined seamlessly into a single workflow. In addition, Spark is 100% compatible with Hadoop's Distributed File System (HDFS), HBase, and any Hadoop storage system, which means that all of an organization's existing data is immediately usable in Spark. And Spark's ability to unify big data analytics reduces the need for organizations to build separate processing systems to take care of their various computational needs.
Faced with the task of performing fast interactive analysis on a massive data warehouse of over 250 petabytes and counting, engineers at Facebook developed their own query machine called Presto. Unlike Spark, which runs programs both in memory and on disk, Presto runs in memory only. This functionality allows Presto to run simple queries on Hadoop in just a few hundred milliseconds, with more complex queries taking only a few minutes. In contrast, scanning over an entire dataset using Hive, which relies on MapReduce, can take anywhere from several minutes to several hours. Presto has also been shown to be up to seven times more efficient on the CPU than Hive. Plus Presto can combine data from multiple sources into a single query, allowing for analytics across an entire organization.
Today Presto is available as an open source distributed SQL query solution that organizations can use to run interactive analytic queries on data sources ranging from gigabytes to petabytes. With the ability to scale to the size of organizations as big as Facebook, Presto is a powerful query engine that has transformed the Hadoop ecosystem and could be transformative for organizations and entire industries as well.
Big data is getting bigger every day. As organizations look for new and better ways to leverage valuable data they will rely less on Hadoop and MapReduce for batch processing and more on open source tools such as Hive, Spark, and Presto to meet the big data demands of the future.
This article is part of the Apache Quill column coordinated by Jason Hibbets. Share your success stories and open source updates within projects at Apache Software Foundation by contacting us at firstname.lastname@example.org.