Three lessons from a summer of data wrangling

Register or Login to like
Register or Login to like
open source button on keyboard

Data is the new black.

And not without reason—it's keeping companies in the black. It helps them understand consumer interaction with their products, make tailored recommendations, improve their services, and optimize everything from supply chains to talent acquisition. Simply put, we can use data to identify problems and engineer solutions. Statistics enable businesses to create data-driven systems like Red Hat's Access Insights and many others.

As more and more people go online, an increasing number of businesses are utilizing cloud computing, mobility, social networking, and big data to become smarter, more immediate, and more relevant. The digital universe is doubling in size every two years, and by 2020 will contain 44 zettabytes—nearly as many digital bits as there are stars in the universe. So the ability to take raw data, process it, and extract insight from it is becoming an extremely valuable skill. Red Hat is pioneering several new forays into the world of data science, and I've had the chance to use big data technology and work on exciting projects ranging from customer engagement analysis to cross-sell prediction. Here are a few things I've learned as an intern at Red Hat.

Clean your data

As a data scientist, you'll spend, on average, 80 percent of your time on "data wrangling"—extracting, cleaning, aggregating, and merging your data. At some point a slight change in the scope of your project may require you to repeat the entire process. Although this can be frustrating at times, it's best to be meticulous here. Clean data will make your life much simpler down the line. It's a crucial step before you can get to the good stuff.

Understand the problem

Take time to understand the nature of the underlying problem you're trying to solve before jumping into an algorithmic implementation. What are you optimizing? What characteristics would a good solution have? This prevents you from zoning in on one solution when better options may be available.

In my cross-sell project, for example, I'm focusing on predicting which customers will buy an emerging product using their online behavior and previous assets. I quickly discovered that only 2 percent of customer accounts actually bought such a product. This means that if I had simply predicted that no customers would buy such a product, I'd get an outstanding 98 percent accuracy! However, such a solution gives us no information whatsoever.

Understanding that 2 percent was the purpose behind this project, so I needed to reframe the question to account for the inherent imbalance in the data. If I examined all the customers a model predicted to be emerging buyers, I'd want as many as possible to actually become emerging buyers. In other words, I wanted to measure precision, the fraction of correctly-identified emerging purchasers divided by the total number of emerging purchasers identified. So an optimal solution would give me high precision, while still being able to detect a large percentage of emerging product customers.

This example refers to a classic rule in machine learning, but the general lesson applies to any statistical problem. Take the case of predicting a numerical feature. What's the best way to handle the noise in my data? Are my features linearly related to the response, or should I use something like a generalized additive model to describe more complex relationships? Is a more computationally intensive model like boosting necessary? Do I need to worry about extraneous features? If so, how should I regularize my model to penalize unnecessary complexity? No matter the question you're answering, defining the parameters of the problem and knowing the strengths of each algorithm helps immensely when you're deciding how to improve your model.

Find the right tools

Use the tools and algorithms that best fit the problem. Don't limit yourself to the ones with which you're most familiar. And don't be afraid to add new skills to your toolbox. In fact, you should go out of your way to do so. Luckily, Red Hat is a great environment full of people willing to help me, a place at which I've been able to dive headfirst into tools ranging from Linux to Pig, Python to Spark.

In addition to learning new technologies, I tried several methods to deal with the imbalanced cross-sell data I mentioned above, including balanced random forests and oversampling (generating synthetic observations derived from the 2 percent minority class). Next, I will be working on using a long short-term memory recurrent neural network to consider each customer's entire history. It’s all about learning!

Despite all the hype, data science isn't as glamorous as it's often made out to be. Even retrieving data can be a hassle, and the work requires so much data cleaning you begin to wonder whether "data janitor" would be a more accurate job title. You'll have to dig deep to interpret your results and unearth insight. But after all is said and done, finding value within the data and sharing that narrative with others will have an immense impact. Remember: your goal is to reveal the story behind the data. Your variables are the characters, and your role lies in capturing the complex interactions between them.

Happy data science-ing!


This article is part of the series of Red Hat Intern Stories. These interns share their experiences about what it’s been like to work for an open organization, and more.

Sharon Xu likes telling stories with data. In the past, she worked at Red Hat as a data scientist intern, mining customer subscription and online behavioral data to build predictive models that help Red Hat engage their customers. She is now a graduate student at MIT.

Comments are closed.

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.