An introduction to big data

Open source resources

Big data: everyone seems to be talking about it, but what is big data really? How is it changing the way researchers at companies, nonprofits, governments, institutions, and other organizations are learning about the world around them? Where is this data coming from, how is it being processed, and how are the results being used? And why is open source so important to answering these questions?

In this short primer, learn all about big data and what it means for the changing world we live in.

What is big data?

There is no hard and fast rule about exactly what size a database needs to be for the data inside of it to be considered "big." Instead, what typically defines big data is the need for new techniques and tools to be able to process it. In order to use big data, you need programs that span multiple physical and/or virtual machines working together in concert to process all of the data in a reasonable span of time.

Getting programs on multiple machines to work together in an efficient way so that each program knows which components of the data to process, and then being able to put the results from all the machines together to make sense of a large pool of data, takes special programming techniques. Since it is typically much faster for programs to access data stored locally instead of over a network, the distribution of data across a cluster and how those machines are networked together are also important considerations when thinking about big data problems.

What kind of datasets are considered big data?

The uses of big data are almost as varied as they are large. Prominent examples you're probably already familiar with include: social media networks analyzing their members' data to learn more about them and connect them with content and advertising relevant to their interests, or search engines looking at the relationship between queries and results to give better answers to users' questions.

But the potential uses go much further! Two of the largest sources of data in large quantities are transactional data, including everything from stock prices to bank data to individual merchants' purchase histories; and sensor data, much of it coming from what is commonly referred to as the Internet of Things (IoT). This sensor data might be anything from measurements taken from robots on an automaker's manufacturing line, to location data on a cellphone network, to instantaneous electrical usage data in homes and businesses, to passenger boarding information taken on a transit system.

By analyzing this data, organizations can learn trends about the data they are measuring, as well as the people generating this data. The hope for this big data analysis is to provide more customized service and increased efficiencies in whatever industry the data is collected from.

How is big data analyzed?

One of the best-known methods for turning raw data into useful information is what is known as MapReduce. MapReduce is a method for taking a large data set and performing computations on it across multiple computers, in parallel. It serves as a model for how to program and is often used to refer to the actual implementation of this model.

In essence, MapReduce consists of two parts. The Map function does sorting and filtering, taking data and placing it inside of categories so that it can be analyzed. The Reduce function provides a summary of this data by combining it all together. While largely credited to research that took place at Google, MapReduce is now a generic term and refers to a general model used by many technologies.

What tools are used to analyze big data?

Perhaps the most influential and established tool for analyzing big data is known as Apache Hadoop. Apache Hadoop is a framework for storing and processing data at a large scale, and it is completely open source. Hadoop can run on commodity hardware, making it easy to use with an existing data center, or even to conduct analysis in the cloud. Hadoop is broken into four main parts:

  • The Hadoop Distributed File System (HDFS), which is a distributed file system designed for very high aggregate bandwidth;
  • YARN, a platform for managing Hadoop's resources and scheduling programs that will run on the Hadoop infrastructure;
  • MapReduce, as described above, a model for doing big data processing;
  • And a common set of libraries for other modules to use.

To learn more about Hadoop, see our Introduction to Apache Hadoop for big data.

Other tools are out there too. One that receives a lot of attention is Apache Spark. The main selling point of Spark is that it stores much of the data for processing in memory, as opposed to on disk, which for certain kinds of analysis can be much faster. Depending on the operation, analysts may see results a hundred times faster or more. Spark can use HDFS, but it is also capable of working with other data stores, like Apache Cassandra or OpenStack Swift. It's also fairly easy to run Spark on a single local machine, making testing and development easier.

For more on Apache Spark, see our collection of articles on the topic.

Other big data tools

Of course, these aren't the only big data tools out there. There are countless open source solutions for working with big data, many of them specialized for providing optimal features and performance for a specific niche or for specific hardware configurations.

The Apache Software Foundation (ASF) supports many of these big data projects. Here are some that you may find useful.

  • Apache Beam is "a unified model for defining both batch and streaming data-parallel processing pipelines." It allows developers to write code that works across multiple processing engines.
  • Apache Hive is a data warehouse built on Hadoop. A top-level Apache project, it "facilitates reading, writing, and managing large datasets … using SQL."
  • Apache Impala is an SQL query engine that runs on Hadoop. It's incubating within Apache and is touted for improving SQL query performance while offering a familiar interface.
  • Apache Kafka allows users to publish and subscribe to real-time data feeds. It aims to bring the reliability of other messaging systems to streaming data.
  • Apache Lucene is a full-text indexing and search software library that can be used for recommendation engines. It's also the basis for many other search projects, including Solr and Elasticsearch.
  • Apache Pig is a platform for analyzing large datasets that runs on Hadoop. Yahoo, which developed it to do MapReduce jobs on large datasets, contributed it to the ASF in 2007.
  • Apache Solr is an enterprise search platform built upon Lucene.
  • Apache Zeppelin is an incubating project that enables interactive data analytics with SQL and other programming languages.

Other open source big data tools you may want to investigate include:

  • Elasticsearch is another enterprise search engine based on Lucene. It's part of the Elastic stack (formerly known as the ELK stack for its components: Elasticsearch, Kibana, and Logstash) that generates insights from structured and unstructured data.
  • Cruise Control was developed by LinkedIn to run Apache Kafka clusters at large scale.
  • TensorFlow is a software library for machine learning that has grown rapidly since Google open sourced it in late 2015. It's been praised for "democratizing" machine learning because of its ease-of-use.

As big data continues to grow in size and importance, the list of open source tools for working with it will certainly continue to grow as well.

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.