Analyzing the Stack Overflow Survey with Python and Pandas

Do your own data science exploration and analysis on the annual developer survey's dataset.

Munch on open data with Python and Pandas

The Stack Overflow Survey Results for 2019 are in! The dataset is quite large; according to the description:

"Stack Overflow's annual Developer Survey is the largest and most comprehensive survey of people who code around the world. Each year, we field a survey covering everything from developers' favorite technologies to their job preferences. This year marks the ninth year we've published our annual Developer Survey results, and nearly 90,000 developers took the 20-minute survey earlier this year."

Some of StackOverflow's analysis interests me, and some do not. Instead of scrolling through the website, I decided to dig into the data—which is available under the Open Database License (ODbL)—and see what I can learn!

I'm using the popular Pandas library, which is a "BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools," according to the library's About page.

Nearly every tutorial reduces the amount of text you have to type when using Pandas features by importing it and assigning the variable for data, like so:

import pandas as pd

# Create a two-dimensional data-structure named df
df = pd.DataFrame([1,1])

In flagrant violation of every rule of data science, I will not be aliasing Pandas to pd, nor will I be aliasing my data frames to df. Science marches forward by taking such bold leaps of imagination.

Configuring Pandas for analysis

I'm going to explore this data interactively using iPython, which you can learn about installing here. You can follow along by opening up the Python interpreter from the command line with python, starting a Jupyter Notebook, or using JupyterLab.

We will start simply by importing the needed library:

In [1]:  import pandas

Next, download the comma-separated value (CSV) file of results, available on Google Drive, to a local directory. After downloading and unzipping the data, take advantage of Pandas' native ability to read CSV.

In [2]:  data = pandas.read_csv("survey_results_public.csv")

Now it's time to ask questions of the data.

Querying the number of respondents

The first interesting thing is to check the number of respondents to the survey. The easiest way to get that data is from the shape of the data frame. The first element will be the number of rows, or respondents, while the second one is the number of columns.

In  [3]:  data.shape
Out [3]: (88883, 85)

That's impressive: 88,883 individuals (represented as rows) provided 85 responses to questions (represented as columns).

Filtering for Pythonistas

As a Python programmer, I wonder what my peers are doing. I'll filter the people who have worked with Python. The exact way to do so is not so intuitive, but after inspecting the data source, I found the LanguageWorkedWith column to be something I can filter for Python developers:

In [4]:  pythonistas = data[data.LanguageWorkedWith.str.contains("Python", na=False)]

Now I can ask Python-specific questions, like: What percentage of responses are Pythonistas?

I can use shape as a raw number of each, and use f-string syntax to format my output to two digits:

In  [5]:  f"{pythonistas.shape[0] / data.shape[0]:.2}"
Out [5]: '0.41'

Wow. 41% of the people who responded to the survey use Python. How many people is that?

In  [6]:  pythonistas.Respondent.count()
Out [6]: 36443

Open source Python developers

Now, how many of those 36,443 people who use Python are involved in open source?

There is an OpenSourcer column that has the data I'm looking for (not OpenSource, which answers a different question).

In [7]:  open_source = pythonistas['OpenSourcer'].value_counts()

Then, I can print out the data by returning the variable's value:

In  [8]:  open_source
Out [8]:
    Never                                                 11310
    Less than once per year                               10374
    Less than once a month but more than once per year     9572
    Once a month or more often                             5187
    Name: OpenSourcer, dtype: int64

What does that tell us about the percentage of open source Python contributors?

In  [9]:  f"{open_source['Once a month or more often'] / pythonistas.shape[0]:.2}"
Out [9]: '0.14'

Only 14% of those 36,443 Python users contribute to open source in any kind of regular cadence. That may seem like a small percentage. Or is it? Is that more or less than the general population when considering all programming languages?

In [10]: general_opensource = data['OpenSourcer'].value_counts()

Since I'm asking about the general population, I'll look at the percentage of all respondents:

In  [11]:  f"{general_opensource['Once a month or more often'] / data.shape[0]:.2}"
Out [11]: '0.12'

Python developers seem to contribute slightly more to open source than the general population of survey respondents.

DevOps and Python

As a DevOps engineer writing books for other Python DevOps engineers, I am naturally curious about how many there are.

I can search within the DevType column to find out:

In [12]:  devops = pythonistas[pythonistas.DevType.str.contains("DevOps specialist", na=False)]

Since I'm asking about the general population, I'll look at all respondents:

In  [13]:  f"{devops.shape[0] / data.shape[0]:.2}"
Out [13]: '0.052'

Around 5% of respondents use Python and have DevOps-related work responsibilities.

In  [14]:  devops.Respondent.count()
Out [14]: 4647

That's my target market. Not bad!

Python developers' experience

Most of my talks at conferences (e.g., Boring Object Orientation) are targeted at intermediate-level Python engineers. I will say those with one to five years of development experience are intermediate, and I can map those constraints to the YearsCode column:

In [15]:  intermediate = pythonistas[pythonistas.YearsCode.isin(set(map(str, range(1, 6))))]

Then I can take a percentage of all survey responses:

In  [16]:  f"{intermediate.shape[0] / data.shape[0]:.2}"
Out [16]: '0.11'

Even better, 11%. That means:

In  [17]:  intermediate.Respondent.count()
Out [17]: 10085

That is a lot of people.

Wrapping up

A lot of excellent research can be done with the StackOverflow report and a little bit of Python. Pandas allows anyone to query such datasets easily and efficiently. There are no Python loops anywhere in this analysis. What's incredible is that I can use these high-level libraries to explore data in simple ways, while the Python interpreter provides low-level manipulation to queries, done in optimized C code, and I get to reap the benefits!

Did you find anything exciting or interesting in the StackOverflow dataset? Share in the comments!