10 articles to become more data science savvy

Boost your data science game in 2020 with Opensource.com's top 10 most-read articles on the topic from 2019.

Image by:

Opensource.com

When LinkedIn released its third annual Emerging Jobs report, engineers everywhere said, "Amen." More than half the list consists of engineering roles, with new fields like robotics appearing for the first time.

But data science had a strong showing as well. The role shows 37% annual growth, topping that aspect of the Emerging Jobs list for the third year in a row.

Looking at the core skills a data scientist needs—including R, Python, and Apache Spark—it's easy to find overlaps with open source. So, we're not surprised that data science was one of the most popular topics at Opensource.com in 2019.

We saw a need for knowledge about diverse data science topics. And our community of authors delivered answers.

For your reading pleasure, we've listed the top 10 data science articles of 2019. We define "top" as the data science articles that were published in 2019 and earned the most page views, starting with the most popular.

Whether you want to use Kubernetes for batch jobs or query 10 years' worth of GitHub data, these articles will boost your data science game in 2020.

Why data scientists love Kubernetes

Kubernetes is having more than a moment. That's due in no small part to its versatility. You might already know that Kubernetes helps software developers and system operators deploy applications in Linux containers. But did you know how helpful it can be for data science as well?

In Why data scientists love Kubernetes, our most popular data science article in 2019, William Benton and Sophie Watson share how Kubernetes supports the data science workflow. From repeatable batch jobs to debugging ML models, this article shares several ways for data scientists to leverage Kubernetes.

How to use Spark SQL: A hands-on tutorial

Wondering how to use a cloud service for big data analytics? How to use Spark SQL: A hands-on tutorial uses Spark DataFrames to show how to use relational databases at scale. DJ Sarkar uses a real-world dataset to walk readers through the process of using Spark SQL.

Rich with screenshots and code, Sarkar's tutorial is the ideal sequel to his first piece on this subject. He shares several ways that you can use Spark to manage structured data obtained from flat files or databases.

9 resources for data science projects

The growth of data science in open source—from machine learning to neural networks—has left many engineers wanting to learn more. In 9 resources for data science projects, Dan Barker shares the books, tools, and online courses he thinks are a must for any engineer who wants to get started.

Barker is especially keen on Cathy O'Neil's book Weapons of Math Destruction, which shares how bias creeps into data and how you can stop it. He also shares a range of websites for newbies to explore.

Getting started with data science using Python

Alongside the rise of data science techniques, Python has seen a meteoric rise. It's now one of the most popular programming languages. When used with libraries like pandas and Seaborn, Python is an ideal entry to data science.

In Getting started with data science using Python, a follow-up to his intro to Python article, Seth Kenlon shares how to create a Python virtual environment; install pandas and NumPy; create a sample dataset; and much more. This article is an especially good read if you want to learn more about data visualization.

How to analyze log data with Python and Apache Spark

Like many articles in our top 10 list, How to analyze log data with Python and Apache Spark is a sequel to an earlier article on using Python and Apache Spark to wrangle data. Once you've learned how to put your data into a clean, structured format, DJ Sarkar offers this piece to help you analyze that data.

Whether you want to see the top 10 error endpoints or content size statistics, Sarkar shows you how to analyze several types of log data in your DataFrame. The data that he uses isn't "big data" from a size or volume standpoint. But these techniques can scale for use with larger datasets.

How to wrangle log data with Python and Apache Spark

How to wrangle log data with Python and Apache Spark, DJ Sarkar's prequel to his piece on analyzing log data, also made our top 10 list. It's no surprise since most organizations use a range of systems and infrastructure that run constantly. Data logs are an ideal way to make sure that everything keeps working effectively.

In this tutorial, Sarkar shows how to use Apache Spark on real-world production logs from NASA. He walks through the process of using Spark to do log analytics at scale on semi-structured log data. This ranges from setting up dependencies to data wrangling.

Querying 10 years of GitHub data with GHTorrent and Libraries.io

Did you know that you can use Kibana or the Elasticsearch API to turn Amazon S3 object-storage data into a searchable Elasticsearch-type cluster? Likewise, did you know about the project that aims to build an offline version of all data available through GitHub APIs?

In Querying 10 years of GitHub data with GHTorrent and Libraries.io, Pete Cheslock explores how to access and query GHTorrent data. You can do it using several formats, including CSV and Google Big Query. Cheslock uses the latter to search indexed GHTorrent data to learn which software languages, licenses, and rates of growth are most popular for GitHub projects.

Predicting NFL play outcomes with Python and data science

Want to increase your machine learning skills in Python? With the NFL playoff season upon us, it's a great time to read Predicting NFL play outcomes with Python and data science, which shares some data science tips to predict plays.

Christa Hayes shows how to spot weird values, predict downs and play types, make regression plots, and train models. Once you've read her article on how to format data for training, this one is the ideal next step.

Analyzing the Stack Overflow Survey with Python and Pandas

Stack Overflow's annual developer survey is a tech behemoth. Nearly 90,000 developers took this year's 20-minute survey and left a lot of data in their wake.

To find certain results, Moshe Zadka used the pandas library to search the survey's anonymized results. If you want to filter Stack Overflow's dataset for certain details (like seeing how many developers use certain languages or contribute to open source projects), Moshe's Analyzing the Stack Overflow Survey with Python and Pandas tutorial shows you how.

4 Python tools for getting started with astronomy

For readers with their heads in the clouds, NumFOCUS republished some of its blog posts on Opensource.com this year. In 4 Python tools for getting started with astronomy, Dr. Gina Helfrich shares how you can get involved in astronomy.

Intimidated? Don't be: Dr. Helfrich says Python packages are so advanced that building data-reduction scripts is much easier than ever before. If you want to play with astronomy imaging datasets, this piece will steer you in the right direction.

What do you want to know about data science?

Data science is an exciting field with countless things to explore. If there's something you want to know about data science, please tell us about it in the comments so we can try to cover it in 2020. Or, if you are so inclined, please share your knowledge with Opensource.com readers by submitting an article about your favorite data science topic.