What is big data?

An introduction to big data

Big data: everyone seems to be talking about it, but what is big data really? How is it changing the way researchers at companies, non-profits, governments, institutions, and other organizations are learning about the world around them? Where is this data coming from, how is it being processed, and how are the results being used? And why is open source so important to answering these questions?

In this short primer, learn all about big data and what it means for the changing world we live in.

What is big data?

There is no hard and fast rule about exactly what size a database needs to be in order for the data inside of it to be considered "big." Instead, what typically defines big data is the need for new techniques and tools in order to be able to process it. In order to use big data, you need programs which span multiple physical and/or virtual machines working together in concert in order to process all of the data in a reasonable span of time.

Getting programs on multiple machines to work together in an efficient way, so that each program knows which components of the data to process, and then being able to put the results from all of the machines together to make sense of a large pool of data takes special programming techniques. Since it is typically much faster for programs to access data stored locally instead of over a network, the distribution of data across a cluster and how those machines are networked together are also important considerations which must be made when thinking about big data problems.

What kind of datasets are considered big data?

The uses of big data are almost as varied as they are large. Prominent examples you're probably already familiar with including social media network analyzing their members' data to learn more about them and connect them with content and advertising relevant to their interests, or search engines looking at the relationship between queries and results to give better answers to users' questions.

But the potential uses go much further! Two of the largest sources of data in large quantities are transactional data, including everything from stock prices to bank data to individual merchants' purchase histories; and sensor data, much of it coming from what is commonly referred to as the Internet of Things (IoT). This sensor data might be anything from measurements taken from robots on the manufacturing line of an automaker, to location data on a cell phone network, to instantaneous electrical usage in homes and businesses, to passenger boarding information taken on a transit system.

By analyzing this data, organizations are able to learn trends about the data they are measuring, as well as the people generating this data. The hope for this big data analysis are to provide more customized service and increased efficiencies in whatever industry the data is collected from.

How is big data analyzed?

One of the best-known methods for turning raw data into useful information is by what is known as MapReduce. MapReduce is a method for taking a large data set and performing computations on it across multiple computers, in parallel. It serves as a model for how to program, and is often used to refer to the actual implementation of this model.

In essence, MapReduce consists of two parts. The Map function does sorting and filtering, taking data and placing it inside of categories so that it can be analyzed. The Reduce function provides a summary of this data by combining it all together. While largely credited to research which took place at Google, MapReduce is now a generic term and refers to a general model used by many technologies.

What tools are used to analyze big data?

Perhaps the most influential and established tool for analyzing big data is known as Apache Hadoop. Apache Hadoop is a framework for storing and processing data in a large scale, and it is completely open source. Hadoop can run on commodity hardware, making it easy to use with an existing data center, or even to conduct analysis in the cloud. Hadoop is broken into four main parts:

  • The Hadoop Distributed File System (HDFS), which is a distributed file system designed for very high aggregate bandwidth;
  • YARN, a platform for managing Hadoop's resources and scheduling programs which will run on the Hadoop infrastructure;
  • MapReduce, as described above, a model for doing big data processing;
  • And a common set of libraries for other modules to use.

To learn more about Hadoop, see our Introduction to Apache Hadoop for big data.

Other tools are out there too. One which has been receiving a lot of attention recently is Apache Spark. The main selling point of Spark is that it stores much of the data for processing in memory, as opposed to on disk, which for certain kinds of analysis can be much faster. Depending on the operation, analysts may see results a hundred times faster or more. Spark can use the Hadoop Distributed File System, but it is also capable of working with other data stores, like Apache Cassandra or OpenStack Swift. It's also fairly easy to run Spark on a single local machine, making testing and development easier.

For more on Apache Spark, see our collection of articles on the topic.

Of course, these aren't the only two tools out there. There are countless open source solutions for working with big data, many of them specialized for providing optimal features and performance for a specific niche or for specific hardware configurations. And as big data continues to grow in size and importance, the list of open source tools for working with it will certainly continue to grow alongside.