Getting started with open source machine learning

No readers like this yet.
A network of people

Despite all the flashy headlines from Musk and Hawking on the impending doom to be visited on us mere mortals by killer robots from the skies, machine learning and artificial intelligence are here to stay. More importantly, machine learning (ML) is quickly becoming a critical skill for developers to enhance their applications and their careers, better understand data, and to help users be more effective.

What is machine learning? It is the use of both historical and current data to make predictions, organize content, and learn patterns about data without being explicitly programmed to do so. This is typically done using statistical techniques that look for significant events like co-occurrences and anomalies in the data and then factoring in their likelihood into a model that is queried at a later time to provide a prediction for some new piece of data.

Common machine learning tasks include classification (applying labels to items), clustering (grouping items automatically), and topic detection. It is also commonly used in natural language processing. Machine learning is increasingly being used in a wide variety of use cases, including content recommendation, fraud detection, image analysis and ecommerce. It is useful across many industries and most popular programming languages have at least one open source library implementing common ML techniques.

Reflecting the broader push in software towards open source, there are now many vibrant machine learning projects available to experiment with as well as a plethora of books, articles, tutorials, and videos to get you up to speed. Let's look at a few projects leading the way in open source machine learning and a few primers on related ML terminology and techniques.


Beyond the project home pages and documentation, there are several excellent sources available to teach the core concepts behind machine learning. While there are hundreds (even thousands) of books and tutorials on ML, I've tried to focus on those targeted towards programmers and less on those that are more rigorous or focused too much on the math behind the scenes. While that stuff is important in the long run, it is often impedes engineers in the getting started phase from trying out real systems with real data.

  1. Programming Collective Intelligence: Building Smart Web 2.0 Applications by Toby Segaran is one of the best introductions to leveraging machine learning ideas for building web applications. Using examples in Python, Segaran lays out the concepts behind many common approaches for leveraging prior history for future benefit.
  2. Data Science from Scratch by Joel Grus. Another Python-based intro, Data Science walks you through core principles like linear algebra, statistics, and probability (but not too much!) before getting into the cornerstones of machine learning: regression, neural networks, and Naive Bayes.
  3. Andrew Ng's Coursera/Stanford University online class in machine learning. In many ways, Mr. Ng, first with his lectures on iTunes and now via Coursera, is the leading educator in machine learning. Be forewarned: this course requires commitment, but is well worth the time for a solid understanding of the topic.
  4. Data Science for Business: what you need to know about data mining and data-analytic thinking by Foster Provost and Tom Fawcett. To quote the preface: "This is not a book about algorithms, nor is it a replacement for a book about algorithms. We deliberately avoided an algorithm-centered approach. We believe there is a relatively small set of fundamental concepts or principles that underlie techniques for extracting useful knowledge from data. These concepts serve as the foundation for many well-known algorithms of data mining."
  5. An introduction to machine learning with web data by Hilary Mason. This video series by Mason and O'Reilly Media is an easy to understand, relatively short set of videos that introduce you to key topics in machine learning like clustering and classification.


While there are many great open source machine learning projects out there, the following projects combine strong technical capabilities with good documentation and accessible communities for asking questions and troubleshooting.


Weka, from the University of Waikato in New Zealand, has long set the standard for open source machine learning with a rich set of tools, lots of algorithms to try out, and user interfaces for exploring data and results. It also has an excellent accompanying book that explains a lot of the ML concepts while showing examples using Weka. While it isn't necessarily up on the latest craze of deep learning and the like, it is a solid project to get started with in understanding the concepts.


Near and dear to my own heart as a co-founder of the project, Apache Mahout has retooled itself in the past year to focus on Apache Spark as well as on overhauling the way one builds ML models while shipping implementations of commonly used ML algorithms. For those still using Hadoop MapReduce, Mahout continues to maintain implementations of key algorithms for classification, clustering, and recommendations using the MapReduce paradigm.

Spark's MLLib

Built from day one for Apache Spark, MLLib is focused on delivering commonly used machine learning algorithms for clustering and classification in a scalable manner. By leveraging Spark, MLLib is able to take advantage of large scale cluster optimizations for processing big data, which can be especially important in machine learning, since many of the algorithms used are iterative in nature and data hungry.


Building on other solid Python libraries like NumPy and SciPy, scikit-learn brings many of the algorithms and tools covered in the above Java/Scala libraries to the Python stack. Add in a nice set of tutorials, and you have a library poised to have you up and learning in no time.


Capitalizing on the latest buzzword within the buzzword laden field of ML, Deep Learning for Java brings to open source a strong set of algorithms designed to do single machine and distributed deep learning on Hadoop and Spark. It has a range of utilities for working with data and also has GPU (graphical processing unit) support.

What is deep learning? Increasingly used at places like Google, Facebook and Amazon, deep learning is a new, large scale approach to neural networks designed to significantly reduce the amount of human intervention needed to train and maintain models while also providing significantly better results. DL4J, as it is called, also has a book (preorder) in the works via Adam Gibson and Josh Patterson.

Bonus projects

As with any overview article, there simply isn't enough room to cover all the great projects in a space, so be sure to also check out H20, Vowpal Wabbit, PredictionIO as well as the MLOSS archive of open source machine learning libraries.

Next steps

The real key to getting started in machine learning is to download some sample data and the code from one of the projects above. Be prepared for lots of trial and error as you explore the different approaches. You will quickly find that, despite all the hype about artificial intelligence, building these applications still requires a good dose of human intelligence to get good results.


This article is part of the Apache Quill column coordinated by Rikki Endsley. Share your success stories and open source updates within projects at Apache Software Foundation by submitting your story to

User profile image.
Grant is the CTO and co-founder of Lucidworks, co-author of “Taming Text” from Manning Publications, co-founder of Apache Mahout and a long-standing committer on the Apache Lucene and Solr open source projects. Grant’s experience includes engineering a variety of search, question answering and natural language processing applications for a variety of domains and languages. He earned his B.S.

Comments are closed.

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.