The three open source projects that transformed Hadoop

No readers like this yet.
Is broadband a necessity or a nice to have?

Opensource.com

Hadoop, an open source software framework with the funny sounding name, has been a game-changer for organizations by allowing them to store, manage, and analyze massive amounts of data for actionable insights and competitive advantage.

But this wasn't always the case.

Initially, Hadoop implementation required skilled teams of engineers and data scientists, making Hadoop too costly and cumbersome for many organizations. Now, thanks to a number of open source projects, big data analytics with Hadoop has become much more affordable and mainstream.

Here's a look at how three open source projects—Hive, Spark, and Presto—have transformed the Hadoop ecosystem.

Hive

An early problem with Hadoop was that while it was great for storing and managing massively large data volumes, analyzing that data for insights was difficult. Only skilled data scientists trained in writing complex "Java Map-Reduce" jobs could unleash Hadoop's analytics capabilities. As a solution to that problem, two data scientists at Facebook, Ashish Thusoo and Joydeep Sen Sarma, who later went on to found the cloud-based Hadoop big data analytics service called Qubole, created Apache Hive in 2008.

Capitalizing on the ease of use of Structured Query Language (SQL), a language that requires relatively little training and is widely used by data engineers, Hive uses a language called HiveQL to automatically translate SQL-like queries into MapReduce jobs executed on Hadoop. Because SQL is the preferred data language taught in schools and used in the industry, Hive, which put SQL on top of Hadoop, transformed Hadoop by making its formidable analytics power more readily available to people and organizations, not just developers. Hive is best used for summarizing, querying, and analyzing large sets of structured data where time is not of the essence.

Spark

While Hive on MapReduce is very effective for summarizing, querying, and analyzing large sets of structured data, the computations Hadoop enables on MapReduce are slow and limited, which is where Spark comes in. Developed at UC Berkeley's AMPLab in 2009 and open sourced in 2010, Apache Spark is a powerful Hadoop data processing engine designed to handle both batch and streaming workloads in record time. In fact, on Apache Hadoop 2.0, Apache Spark runs programs 100 times faster in memory and 10 times faster on disk than MapReduce.

The advantage for users is that Spark not only supports operations such as SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms, it also allows these multiple capabilities to be combined seamlessly into a single workflow. In addition, Spark is 100% compatible with Hadoop's Distributed File System (HDFS), HBase, and any Hadoop storage system, which means that all of an organization's existing data is immediately usable in Spark. And Spark's ability to unify big data analytics reduces the need for organizations to build separate processing systems to take care of their various computational needs.

Presto

Faced with the task of performing fast interactive analysis on a massive data warehouse of over 250 petabytes and counting, engineers at Facebook developed their own query machine called Presto. Unlike Spark, which runs programs both in memory and on disk, Presto runs in memory only. This functionality allows Presto to run simple queries on Hadoop in just a few hundred milliseconds, with more complex queries taking only a few minutes. In contrast, scanning over an entire dataset using Hive, which relies on MapReduce, can take anywhere from several minutes to several hours. Presto has also been shown to be up to seven times more efficient on the CPU than Hive. Plus Presto can combine data from multiple sources into a single query, allowing for analytics across an entire organization.

Today Presto is available as an open source distributed SQL query solution that organizations can use to run interactive analytic queries on data sources ranging from gigabytes to petabytes. With the ability to scale to the size of organizations as big as Facebook, Presto is a powerful query engine that has transformed the Hadoop ecosystem and could be transformative for organizations and entire industries as well.

Big data is getting bigger every day. As organizations look for new and better ways to leverage valuable data they will rely less on Hadoop and MapReduce for batch processing and more on open source tools such as Hive, Spark, and Presto to meet the big data demands of the future.

Apache
Quill

This article is part of the Apache Quill column coordinated by Jason Hibbets. Share your success stories and open source updates within projects at Apache Software Foundation by contacting us at open@opensource.com.

User profile image.
Jonathan Buckley is the Interim Chief Marketing Officer at Qubole

3 Comments

Jonathan, very informative article. One alternative to Hadoop is HPCC Systems which offers a proven open source Big Data platform designed by data scientists that provides a complete integrated solution from data ingestion and data processing to data delivery. More at http://hpccsystems.com

Given that this is part of the Apache Quill series, does anyone else notice that mentions of "Apache" before the various project names here are few and far between?

The best practice is to use the full Apache Hadoop form of the name in the first and most prominent uses on any page.

Content wise, good stuff. "Hadoop" is such a misunderstood term that it's important to get people to think about *how* they're actually going to implement big data-whatever - and these are some of the other key projects that help you with that.

Thanks for your post! Hive is fundamentally an operational data store that's also suitable for analyzing large, relatively static data sets where query time is not important. Hive makes an excellent addition to an existing data warehouse, but it is not a replacement. Instead, using Hive to augment a data warehouse is a great way to leverage existing investments while keeping up with the data deluge. More at www.youtube.com/watch?v=1jMR4cHBwZE

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.