How to become a data scientist

Data scientists are in high demand. This guide tells you what you need to know for a career in data science—and how to learn it.

Image by:

Opensource.com

Once upon a time, I wanted to be an evolutionary biologist. To make a long story short, I had a change of heart and dropped out of my PhD program to pursue a career in computer science. I'm now a senior software engineer at Red Hat, where I work on a variety of machine learning and data science projects (you can read more about my journey on my blog). Not long after joining Red Hat, many people—including three different University of Chicago grad students—asked me about transitioning to a career in data science, so I started looking into it.

The awesome thing about jumping into data science now is that everything (from the software, to the learning materials, to the discussion) is extremely open, so there's never been a better time to be an autodidact. In case it helps others considering a career in data science, here's what I've learned about making the leap.

Open discussion

As a warmup, I recommend the following links for background information on data science:

8 Skills You Need to Be a Data Scientist
What's the difference between a data architect, data analyst, data engineer, and data scientist? ("data analyst" will probably be less exciting than "data scientist" for people with a scientific background)
Advice from a data scientist at Quora
/r/MachineLearning is a great subreddit for keeping up-to-date with the latest happenings and research in the machine learning world
Other good subreddits to check out include /r/Statistics and /r/DataIsBeautiful (a data visualization subreddit)

In general, members of the data science community are quite open about sharing their diverse experiences and backgrounds, which can be super helpful when you're choosing what particular flavor of data science to pursue.

Open experience

If you're serious about pursuing a career in data science, getting experience is more important than anything else. I know this advice rings true for many other fields, but because data science requires such a high level of mathematical and statistical maturity, it can be somewhat difficult to signal to potential employers that you know how to effectively apply these sophisticated techniques without relevant work experience.

If you're a student, your top priority should be landing an internship. It will make the eventual full-time job search much easier. Unfortunately, internships are also the least "open" aspect of the data scientist pursuit because they're usually only available to students. However, there are plenty of other open opportunities for gaining experience. For example, you can try out open competitions, like those on Kaggle.

There's also open source software development. Contributing to open source projects and/or putting your personal projects on GitHub (here's mine) is a great way of demonstrating your data science expertise. You can also consider pro bono ("open heart?") work. Have a favorite local restaurant? Ask its management if they'd be interested in a free data science consultation. (I know someone who actually did this!)

Finally, be sure to create a LinkedIn account and keep it updated (here's mine). LinkedIn has become an extremely valuable tool for recruiters, so it's important to be discoverable there.

Open education

Next, my favorite part, open education. Over the past few years, there has been a really exciting trend towards massive open online courses (aka MOOCs), which are basically full courses (including homework and exams) offered by top institutions and firms (e.g., Stanford, Harvard, Google) on a wide variety of topics. There are many companies and websites offering MOOCs, but some of my favorites include: Coursera, edX, Udacity, Saylor, and Khan Academy.

For guidance on which courses to take, I've put together a detailed data science curriculum and published my own full course history. Some subjects you'll definitely want to cover include:

Calculus, at least up to partial derivatives, which is typically Calculus III
Linear algebra
Statistics, including Bayesian and frequentist theory
Algorithms
Machine learning and its big algorithms; natural language processing is probably the most useful sub-field to learn
Other topics include graph theory, game theory, and information theory

Open source software

Finally, the part most readers of Opensource.com will be familiar with: open source software. Open source software abounds in data science, but, just like Linux, the code being free and open does not mean it's inferior to its proprietary counterparts. In fact, the open source solutions are typically the best in class.

Important open source software for data scientists to know includes:

Skip to content
Programming and development
Red Hat Developers Blog

Programming cheat sheets

Try for free: Red Hat Learning Subscription

eBook: An introduction to programming with Bash

Bash Shell Scripting Cheat Sheet

eBook: Modernizing Enterprise Java
Programming
- Almost all data scientist positions require cleansing and transforming data on a large scale, and Python is essential because it's typically the language of choice for this task. Important Python packages/libraries include: scikit-learn, NumPy, Keras, TensorFlow, Theano, SciPy, pandas, and StatsModels
- Know the R software for statistical computing
- Many data science tools have command line interfaces, so being comfortable with the *nix terminal can be a huge boon for productivity
- Understand the basics of Git specifically and version control in general
Databases
- The best way to learn databases is by working with them. Find a database and practice writing queries for it
- SQL knowledge is critical, and familiarity with NoSQL databases (e.g., MongoDB) is useful, too
Big Data tools
- Be familiar with Apache Hadoop, MapReduce, Apache Spark, Apache Pig, Apache Hive, Apache Mahout, Apache Solr, and Apache Lucene

Get started

These guidelines should get you off on the right foot in your pursuit of a career in data science. If you know about any other helpful data science resources, be sure to share them in the comments.