Machine learning with Python: Essential hacks and tricks

Image by:

WOCinTech Chat. Modified by Opensource.com. CC BY-SA 4.0

It's never been easier to get started with machine learning. In addition to structured massive open online courses (MOOCs), there are a huge number of incredible, free resources available around the web. Here are a few that have helped me.

Start with some cool videos on YouTube. Read a couple of good books or articles, such as The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World. And I guarantee you'll fall in love with this cool, interactive page about machine learning.

Learn to clearly differentiate between the buzzwords—for example, machine learning, artificial intelligence, deep learning, data science, computer vision, and robotics. Read or listen to talks by experts on each of them. Watch this amazing video by Brandon Rohrer, an influential data scientist. Or this video about the clear differences between various roles associated with data science.

Clearly set a goal for what you want to learn. Then go and take that Coursera course. Or take the one from the University of Washington, which is pretty good too.

Follow some good blogs: KDnuggets, Mark Meloon's blog about data science careers, Brandon Rohrer's blog, Open AI's research blog.

If you are enthusiastic about taking online courses, check out this article for guidance on choosing the right MOOC.

Most of all, develop a feel for it. Join some good social forums, but resist the temptation to latch onto sensationalized headlines and news. Do your own reading to understand what it is and what it is not, where it might go, and what possibilities it can open up. Then sit back and think about how you can apply machine learning or imbue data science principles into your daily work. Build a simple regression model to predict the cost of your next lunch or download your electricity usage data from your energy provider and do a simple time-series plot in Excel to discover some pattern of usage. And after you are thoroughly enamored with machine learning, you can watch this video.

Is Python a good language for machine learning/AI?

Familiarity and moderate expertise in at least one high-level programming language is useful for beginners in machine learning. Unless you are a Ph.D. researcher working on a purely theoretical proof of some complex algorithm, you are expected to mostly use the existing machine learning algorithms and apply them in solving novel problems. This requires you to put on a programming hat.

There's a lot of talk about the best language for data science. While the debate rages, grab a coffee and read this insightful FreeCodeCamp article to learn about data science languages. Or, check out this post on KDnuggets to dive directly into the Python vs. R debate.

For now, it's widely believed that Python helps developers be more productive from development to deployment and maintenance. Python's syntax is simpler and at a higher level when compared to Java, C, and C++. It has a vibrant community, open source culture, hundreds of high-quality libraries focused on machine learning, and a huge support base from big names in the industry (e.g., Google, Dropbox, Airbnb, etc.).

Fundamental Python libraries

Assuming you go with the widespread opinion that Python is the best language for machine learning, there are a few core Python packages and libraries you need to master.

NumPy

Short for Numerical Python, NumPy is the fundamental package required for high-performance scientific computing and data analysis in the Python ecosystem. It's the foundation on which nearly all of the higher-level tools, such as Pandas and scikit-learn, are built. TensorFlow uses NumPy arrays as the fundamental building blocks underpinning Tensor objects and graphflow for deep learning tasks. Many NumPy operations are implemented in C, making them super fast. For data science and modern machine learning tasks, this is an invaluable advantage.

Pandas

Pandas is the most popular library in the scientific Python ecosystem for doing general-purpose data analysis. Pandas is built upon a NumPy array, thereby preserving fast execution speed and offering many data engineering features, including:

Reading/writing many different data formats
Selecting subsets of data
Calculating across rows and down columns
Finding and filling missing data
Applying operations to independent groups within the data
Reshaping data into different forms
Combing multiple datasets together
Advanced time-series functionality
Visualization through Matplotlib and Seaborn

Matplotlib and Seaborn

Data visualization and storytelling with data are essential skills for every data scientist because it's crtitical to be able to communicate insights from analyses to any audience effectively. This is an equally critical part of your machine learning pipeline, as you often have to perform an exploratory analysis of a dataset before deciding to apply a particular machine learning algorithm.

Matplotlib is the most widely used 2D Python visualization library. It's equipped with a dazzling array of commands and interfaces for producing publication-quality graphics from your data. This amazingly detailed and rich article will help you get started with Matplotlib.

Seaborn is another great visualization library focused on statistical plotting. It provides an API (with flexible choices for plot style and color defaults) on top of Matplotlib, defines simple high-level functions for common statistical plot types, and integrates with functionality provided by Pandas. You can start with this great tutorial on Seaborn for beginners.

Scikit-learn

Scikit-learn is the most important general machine learning Python package to master. It features various classification, regression, and clustering algorithms, including support vector machines, random forests, gradient boosting, k-means, and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. It provides a range of supervised and unsupervised learning algorithms via a consistent interface. The library has a level of robustness and support required for use in production systems. This means it has a deep focus on concerns such as ease of use, code quality, collaboration, documentation, and performance. Look at this gentle introduction to machine learning vocabulary used in the Scikit-learn universe or this article demonstrating a simple machine learning pipeline method using Scikit-learn.