Text: it's everywhere. It fills up our social feeds, clutters our inboxes, and commands our attention like nothing else. It is oh so familiar, and yet, as a programmer, it is oh so strange. We learn the basics of spoken and written language at a very young age and the more formal side of it in high school and college, yet most of us never get beyond very simple processing rules when it comes to how we handle text in our applications. And yet, by most accounts, unstructured content, which is almost always text or at least has a text component, makes up a vast majority of the data we encounter. Don't you think it is time you upgraded your skills to better handle text?
Thankfully, open source is chock full of high-quality libraries to solve common problems in text processing like sentiment analysis, topic identification, automatic labeling of content, and more. More importantly, open source also provides many building block libraries that make it easy for you to innovate without having to reinvent the wheel. If all of this stuff is giving you flashbacks to your high school grammar classes, not to worry—we've included some useful resources at the end to brush up your knowledge as well as explain some of the key concepts around natural language processing (NLP). To begin your journey, check out these projects:
- Stanford's Core NLP Suite A GPL-licensed framework of tools for processing English, Chinese, and Spanish. Includes tools for tokenization (splitting of text into words), part of speech tagging, grammar parsing (identifying things like noun and verb phrases), named entity recognition, and more. Once you've got the basics, be sure to check out the other projects from the same group at Stanford.
- Natural Language Toolkit If your language of choice is Python, then look no further than NLTK for many of your NLP needs. Similar to the Stanford library, it includes capabilities for tokenizing, parsing, and identifying named entities as well as many more features.
- Apache Lucene and Solr While not technically targeted at solving NLP problems, Lucene and Solr contain a powerful number of tools for working with text ranging from advanced string manipulation utilities to powerful and flexible tokenization libraries to blazing fast libraries for working with finite state automatons. On top of it all, you get a search engine for free!
- Apache OpenNLP Using a different underlying approach than Stanford's library, the OpenNLP project is an Apache-licensed suite of tools to do tasks like tokenization, part of speech tagging, parsing, and named entity recognition. While not necessarily state of the art anymore in its approach, it remains a solid choice that is easy to get up and running.
- GATE and Apache UIMA As your processing capabilities evolve, you may find yourself building complex NLP workflows which need to integrate several different processing steps. In these cases, you may want to work with a framework like GATE or UIMA that standardizes and abstracts much of the repetitive work that goes into building a complex NLP application.
If all of this talk of parsing, tokenization, and named entities has left you wondering how to get started, be sure to check out the following books:
- Written by Drew Farris, Tom Morton and yours truly, Taming Text is aimed at programmers getting started in NLP and Search. Each chapter explains the concepts behind functionality like search, named entity recognition, clustering, and classification. Each chapter also shows working examples using well-known open source projects.
- Natural Language Processing with Python by Steven Bird, Ewan Klein, and Edward Loper is the definitive guide for NLTK, walking users through tasks like classification, information extraction and more.
- If academic rigor is what you are looking for, Christoph Manning and Hinrich Schütz's Foundations of Statistical Natural Language Processing is great place to start. It not only explains the concepts behind many of the techniques of NLP, but provides the math to back it up.
Once you've graduated to more advanced NLP tasks, you may also wish to check out projects like Apache cTakes (aimed at medical NLP), Apache Mahout, and MALLET from UMass Amherst. If you are looking to try out new approaches using big data analysis and complex machine learning, be sure to check out the Deeplearning4J project.
With a little practice and creativity, combined with the power of open source and the projects above, your next application just might be at the forefront of truly making language processing as natural as handling all those zeros and ones!
This article is part of the Apache Quill column coordinated by Rikki Endsley. Share your success stories and open source updates within projects at Apache Software Foundation by submitting your story to Opensource.com.