Big data’s one of many domains where open source shines. From open source alternatives for Google Analytics to new features in MySQL, 2020 brought several ways for open source enthusiasts to learn big data skills.
Get up to speed on how open source data science languages, libraries, and tools help us understand our world better by reviewing the top 10 data science articles published on Opensource.com last year.
The 7 most popular ways to plot data in Python
Once upon a time, Matplotlib was the lone way to make plots in Python. In recent years, Python's status as data science's de facto language changed that. We have a plethora of ways to plot data using Python today.
In this article, Shaun Taylor-Morgan walks through seven ways to plot data in Python. Don't worry if you're a Matplotlib user: It's covered, along with Seaborn, Plotly, and Bokeh. You'll find codes and charts per plotting library, plus some newcomers to the Python plotting field: Altair, Pygal, and pandas.
Transparent, open source alternative to Google Analytics
Many websites use Google Analytics to track their activity metrics. Its status as a de facto tool leaves some to wonder if open source options exist. In this overview of Plausible Analytics, Marko Saric proves they do.
If you want to compare Google Analytics against open source options, you will find Marko's article helpful. It's especially great if you're a website admin trying to comply with new data collection regulations, such as GDPR.
If you want to learn more about Plausible, you'll find links to Plausible's code and roadmap on GitHub in Marko's article.
5 MySQL features you need to know
After MySQL 8.0 came out in April 2018, its release cycle for new features updated to four times per year. Despite the more frequent deployments, many users don't know about new MySQL features that could save them hours of time.
In this March 2020 article, Dave Stokes shares five features that were new to MySQL. They include dual passwords, new shells, and better SQL support. But keep in mind that these updates are now close to a year old: There's a lot more to discover in MySQL since then!
Using C and C++ for data science
Did you know that C and C++ are both strong options for data science projects? They're especially good choices to run data science programs on the command line.
In this article, Cristiano L. Fontana uses C99 and C++11 to write a program that uses Anscombe's quartet dataset. The step-by-step instructions include reading data from a CSV file, interpolating data, and plotting results to an image file.
Using Python to visualize COVID-19 projections
The COVID-19 pandemic brought an influx of data to the proverbial forefront. In this article, Anurag Gupta shows how to use Python to project COVID-19 cases and deaths across India.
Anurag walks through downloading and parsing data, selecting and plotting data for India, and creating an animated horizontal bar graph. If you're interested in the complete script, you'll find a link at the end of this article.
How I use Python to map the global spread of COVID-19
If you want to track the spread of COVID-19 globally, you can use Python, pandas, and Plotly to do it. In this article, Anurag Gupta explains how you can use them to clean and visualize raw data.
Using screenshots to help, Anurag shares how to load data into a pandas DataFrame; clean and modify the DataFrame; and visualize the spread in Plotly. The complete code yields a gorgeous graph, and the article ends with a link to download and run it.
3 ways to use PostgreSQL commands
In this follow-up to his article on getting started with PostgreSQL, Greg Pittman shares how he uses PostgreSQL commands to keep his grocery shopping list updated.
Whether you want to do per-item entry or bring order to complex tables, Greg explains how to create the commands you need. He also shows how to output your lists once you're ready to print them.
No matter how long your shopping list is, PostgreSQL commands—especially the WHERE parameter—can bring ease to your life beyond programming.
Using Python and GNU Octave to plot data
Python is data science's language du jour, but how can you use it for specific tasks? In this article, Cristiano Fontana shares how to write a program in Python and GNU Octave.
Cristiano walks through each step to read data from a CSV file, interpolate the data with a straight line, and plot the result to an image file. From printing output and reading data to plotting the outcome, Fontana's step-by-step guidelines explain the whole process in Python and GNU Octave.
It keeps your data in one place, avoids code repetition, and is fully customizable. If you want to try out the code in this article, Szymon links to it in CodeSandbox at the end.
How to process real-time data with Apache tools
We process so much data today that storing data for analysis later might be impossible soon. Teams that handle failure prediction and other context-sensitive data need to get this information in real time, before it hits a database. Luckily, you can do this with Apache tools.
In this article, Simon Crosby explains how Apache Spark—a unified analytics engine—can process large datasets in real time at scale. For instance, "Spark Streaming breaks data into mini-batches that are each independently analyzed by a Spark model or some other system," he writes.
If Apache's not your thing, Simon presents other open source options. Flink, Beam, and Stanza—along with Apache-licensed SwimOS and Hazelcast—are just a few of your choices.
What do you want to know?
What would you like to know about big data and data science? Please share your suggestions for article topics in the comments. And if you have something interesting to share about data science, please consider writing an article for Opensource.com.