Teaching big data processing with open source software

No readers like this yet.
Open data brain


The continuing growth of massive and diverse data volumes, and the growth of data intensive applications, has presented a need to find effective means of data management across all sectors. According to a recent report, businesses face a huge skill gap in the management of big data, with the gap growing from 400 in 2007 to 4,000 in 2012 in the United Kingdom alone. In addition to this, there is a general lack of understanding among students of current data analytics processes, which are becoming extremely important for future challenges with the growth of the Internet of Things (IoT) and real-time data.

As a computer scientist, studying and building modeling and simulation applications, I was initially perplexed as to the attraction towards the term big data. Business seems to focus on Hadoop-related software for data analytics, and having Hadoop-related projects on your resume can be a bonus. As a teacher of cloud computing and software engineering, I decided to assign two students Hadoop-related projects for big data management with a "smart cities" focus, and interviewed them about their learning objectives to see what they thought about the technologies.

As a prerequisite, the students were given full freedom to examine the topic of Hadoop big data processing, and asked to explore whichever tools they wanted to in this area. Hadoop is a set of tool that supports the running of big data applications with multiple job executions to allow massive amounts of data to be processed quickly. It is an environment to run MapReduce jobs that are usually sorted in batch. Hadoop has become one of the most important tools in science projects which require analyzing data. Some of the Hadoop-related tools my students investigated included:

  • Apache Ambari: A framework for managing and monitoring Hadoop clusters
  • Apache Pig: A platform for running code for analyzing large data set of data.
  • Apache Sqoop: A tool used for moving data between Hadoop and other data stores
  • Apache ZooKeeper: A tool used for providing synchronization and maintaining the set up of information.
  • Apache Spark: A newer tool used to run analysis faster on some types of data.
  • Apache Flume: A system that gathers information that is later stored in HDFS.
  • Apache Hive: A tool which allows users to use a SQL-like language to analyze data.
  • Apache Oozie: A tool to start analysis jobs that have been broken into different parts in the correct sequence.
  • Hadoop Distributed File System (HDFS): A framework for dividing data between nodes.
  • HCatalog: A tool which is used to upload tables and is used to manage data, which enables users to analyze data using different processing tools like Pig, Hive, and MapReduce.

After the students successfully finished their final year dissertations, I asked them some questions to understand what they learned from the experience. Here are the responses from both of my students, Saudamini Sonalker and Rafiat Olubodun Kadiri, who were doing independent experiments with Hadoop.

Why did you want to learn Hadoop? Just to learn something new, or were you influenced by industry interest in the project?

Saudamini: I was primarily motivated to work on this topic after having read a book about big data by Victor Mayer-Schonberger and Kenneth Cukier: Big Data: A Revolution That Will Transform How We Live, Work, and Think. The predictive nature of tools that assist big data processing is what drew me to learning more about it. Concentrating on smart city data was also an interesting element of this project. I want to learn and understand more about how city data can be utilized to make cities efficient, green, and smart.

Rafiat: I chose the topic of Hadoop because it is a new area; it is a buzz word, and recently has been dominating the market. Different businesses make use of it, including social media websites such as Twitter and Facebook using Hadoop to mine data for different purposes, enabling them to make reasonable business decisions.

What do companies use big data for? What kind of questions are they using it to ask?

Saudamini: Companies use big data for numerous purposes. Amazon utilizes it for recommendations, Skyscanner and Kayak for adjusting flight prices by monitoring an individual's past searches, and Google uses it to determine the order of search results. An interesting use of big data was Amsterdam's Energy Atlas project. It used energy consumption data from within the city to promote renewable energy by making its citizens aware of their own usage.

Rafiat: Different companies have different use of big data. The usage of big data for a company depends on what type of service they provide to the public. Businesses like eBay and Amazon use big data to make predictions of what customers may desire according to their previous purchase history and similar purchase by other customers

What problems did you have when installing Hadoop while setting up the sandbox environment? What led you to choose Hortonworks Sandbox for your experiments?

Saudamini: I explored a couple of options before deciding on Hortonworks Data Platform. The major reason for choosing it was because it is open source and free. Other competitors like MapR, Amazon Web Services and Cloudera, however good the platforms, were expensive. However, there were strict memory requirements to set up the sandbox. A 64-bit processor was necessary to access the sandbox via virtual machine, and it required at least 4GB RAM. This slowed the process down for me and the platform has no flexibility in terms of requirements.

Rafiat: There are quite a number of public Hadoop clusters that have been designed for storing and analyzing large amounts of unstructured data in a computing environment. They are available on cloud infrastructures such as Heroku, Hortonworks Sandbox, Azure, and others.

After a few searches, I decided to use Hortonworks Data Platform, an open source apache Hadoop data platform. The system requirements included using Windows or Mac operating system, at least 4GB of RAM, a virtual machine environment, and a 64-bit chip that supports virtualization.

The first step was to download a virtual machine, then download the sandbox from the Hortonworks website. After this I connected to the sandbox with the given IP address.

There were some negative aspects to using the Hortonworks sandbox for research, which I still face. I was unable to access the sandbox with the given IP address for a while, but after multiple trials, it worked. Second, the virtual machine slowed down my computer the moment it is switched on, and it took a long time for a query to load.

Further, I face issues like when my machine goes off itself without allowing me to shut down the virtual machine down myself, the next time I switch it on, the virtual machine comes up with configuration errors which restricts me from accessing the sandbox. Another issue that I face is not being able to access some of the tools sometimes, which slows down my research.

How does the Hortonworks Data Platform work?

Saudamini: The platform can be divided into three layers: the data access layer, cluster resource management, and HDFS. The data access layer is where the user uploads, catalogues, and manages data; one uses this layer to enter their Hive/Pig jobs for the system to perform. Cluster resource management (YARN) is an architectural hub for data processing engines so multiple applications can be run on the HDFS. This layer essentially works as a translator for the other two. Finally, HDFS is where the MapReduce jobs are run in parallel between the master and slave nodes.

Ambari is a web-based GUI that can speak to the underlying machinery and allows user to set up and manage a Hadoop cluster.

Rafiat: When accessing the sandbox, I was directed to a page where I had access to different tools like Hive, File browser, Pig, Job browser, and others. I could upload different type of files (zip file, csv, xml), then create tables from tools like Hive, Pig and HCatalog with the file that has been uploaded through the file browser icon. I could then create queries to provide different type of tables with different criteria to fit a requirement.

Ambari can be used to monitor and manage Hadoop clusters. Monitoring the outcome of the queries that have been carried out, and showing the effect of the queries on the CPU usage, memory usage, network usage, etc.

What tools did you explore, and what were the new things you learned in the process?

Saudamini: Initially, I planned on exploring Pig and Hive, but I had issues running the Pig script on Hortonworks Sandbox and hence stuck with Hive. Hive Query Language is very similar to SQL, therefore if someone is proficient in the latter, then they shouldn't have an issue working with the tool. On Hortonworks Sandbox, Hive has a graphical user interface called Beeswax. Hive converts queries you write into MapReduce jobs. Whether or not one needs multiple options to process data depends on the skill sets of the users working on a large project. Hive diminishes the need to train or hire external resources in order to fill in the gap. The flexibility is useful in scenarios like those.

Rafiat: I used Hive, which uses an SQL-like scripting language which is known as HiveQL. It is suitable for users that are familiar with structured query language. Additionally, Pig was used as a language for data analysis and it is also a high level processing layer on Hadoop. It consist of a language called Pig Latin.

What kind of files did you process? Smart city datasets?

Saudamini: I concentrated on smart city data, specifically London traffic and social data.

Rafiat: Smart city data were used for this experiments most of the data was retrieved from ITU data statistics website and London data store website.

What were the goals of the experiments? What did you achieve?

Saudamini: The goal was to observe performance of the underlying machinery and cluster loads. After processing different big data files I compared results of CPU performance, cluster loads, memory usage, and network usage.

Transport and social data was processed on the platform to check the feasibility of implementing smart offices within London to reduce traffic and save people's time. The hypothesis was that there would be a correlation between high traffic boroughs and boroughs with most work destinations. Although that held up in most cases, these boroughs were not in central London like initially imagined.

Rafiat: The goal of the experiment was to analyze set of data that will be retrieved from different sources like ITU (International Telecommunication Union) website, London data store, public data sets on Amazon Web Services, etc. The aim was to use volume as one the criteria to consider while analyzing the data. By doing this, the experiment will be able to show how long it will take for the data to be processed.

If you were given a project now for big data processing, how would you approach it?

Saudamini: If time is not a concern and price is an issue then I would recommend using Hortonworks Sandbox as its flexibility towards type of data source, data processing tool options and Ambari environment give a wholesome data management experience. However, if time is of the essence and money not a factor then it would be beneficial to look at other options which provide a similar user experience in the cloud.

Rafiat: I would use of Hortonworks Data Platform on a separate machine dedicated to the platform, as my own machine was not very high spec.

As a computer science student, do you think for data management we should always use tools like these?

Saudamini: If the dataset you are working with it large, then I think it is advisable to use big data tools like these. Their flexibility and quick processing make them ideal to be deployed as solutions to smart city issues. However, I am not convinced that we should always use them. We could actually try and avoid using these tools if the dataset doesn't demand it. A lot of the analytical functions can be done by other BI tools. Big data tools can have a steep learning curve, and training users should be factored in while deploying systems that utilize them.

Rafiat: Data management is a very important topic There are different advantages to managing data effectively as a student, individual or organization. This includes preventing data duplication, which will allow memory space to be saved. It allows validation of results if need be. Data management allows proper understanding of data, the use of queries to provide specific information needed, so data can be understood easily.

In conclusion, we got mixed results on the use of tools to process ig data applications. An open Hadoop data platform seemed like the obvious choice at the time. As previously described, MapReduce is at the core of a Hadoop Distributed File System. Hortonworks Sandbox is equipped with YARN, the second generation of MapReduce. It divides the two important tasks and makes the process more efficient. YARN supports batch as well as real-time processing projects. The Hortonworks Data Platform has the capability to adapt to the user’s existing data architecture which is a huge plus. In addition to the platform being cost-free, efficient and adaptable, it also has an extensive list of tutorials and user based guides on using the services it provides.

There are a lot of big data processing platforms available as a result of it being the current buzzword. Most services; Amazon Web Services, Cloudera, MapR etc. to name a few, charge the user depending upon the traffic and amount of data they process. Cloudera’s website claims, "The company’s enterprise data hub (EDH) software platform empowers organizations to store, process and analyze all enterprise data, of whatever type, in any volume—creating remarkable cost-efficiencies as well as enabling business transformation."

The current move towards open data generating massive amounts of data, needs real-time processing needing intelligent solutions to process it. Having more tools which are open source can fuel further open data research impacting not only computing, but social sciences, where economists and governments can make use of big data as well.

Back to

This article is part of the Back to School series focused on open source projects and tools for students of all levels.

User profile image.
A software engineer interested in various activities involving software design, development and delivery working with various platforms and tools.

1 Comment

Great point Will – in addition to confusing “Big Data” and “advanced analytics” (they are not the same thing), I think a lot of marketers particularly are trying to get “big” for the sake of being big – Gartner has a prediction (I think it was last year) that 90% of the data in “data lakes” will never be used. http://goo.gl/Ksdrqj

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.