Building a better matching solution with Mensa

neon sign with head outline and open source why spelled out

Image by:

Opensource.com

I've been designing and developing commercial software for more than 30 years, and I'm pleased to announce that for the first time, some of my software has been released as a new open source project. For open source at Dell, it's a Java project called Mensa.

Mensa is a Java class library. As such, the primary Mensa user is a programmer working in Java. However, I'm currently involved with a team that is using Mensa from C# (using IKVM.NET), but that's a subject for another time. Feel free to contact me if you're specifically interested in that.

Mensa in a nutshell

Mensa provides a robust and efficient solution to a class of pattern matching problems. Specifically, Mensa makes it easy to find any or all occurrences of any keywords within a source text. For example, finding all references to named characters (the keywords) in a novel (the source text).

Mensa is efficient in that it will quickly process source text in a single pass, regardless of whether it's searching for one keyword or a million keywords.

Mensa is generic in the sense that it is not limited to matching textual data. Instead, a source text is, by definition, an arbitrarily long sequence of symbols. The actual data type of these symbols can be anything—characters, bytes, numbers, hieroglyphs, musical notes, nucleotides, etc.

Project history

For a number of years, beginning within Quest Software and continuing within Dell software (after a company acquisition), the core Mensa contributors have been working on platform technologies for discovering, mapping, and integrating otherwise disconnected data across various enterprise information sources. One key ingredient in such solutions is the ability to accurately and efficiently find stuff.

Most recently, we have focused on delivering automatic digital asset classification technologies, such as those used by Dell One Identity Manager Data Governance Edition classification module.

We had been using licensed, third-party software for dictionary-based keyword searching, but for a variety of reasons we knew that component would eventually need to be replaced. About a year ago, we began looking for open source alternatives. However, no available open source solution had all the elements we were looking for: generics, flexibility, fuzzy matching, large dictionary efficiency, etc.

So, in early 2014, I set out to create a new 'Java Aho-Corasick Library' that would satisfy all of these requirements. Later, we changed the name to 'Mensa'.

Standard pattern matching in Java

The standard Java library includes support for finding specific patterns within character strings. There are two basic mechanisms to choose from:

String search finds an instance of a literal string value within another string. For example, you could search for the word 'Arbor' in the text 'Welcome to Ann Arbor'. The result would indicate that the word was found at position 15 (Java starts counting at zero).
Regular expression matching finds an instance of a string pattern within another string. For example, you could search for the pattern 'A...r' in 'Welcome to Ann Arbor.' The '.' in the pattern means 'any character', so this would also find a match at position 15.

In many situations, these built-in Java capabilities work just fine—but not always. Consider the following problem scenarios:

The standard library functions operate on character strings in memory. They are not well-suited to searching through very large texts that don't easily fit completely in memory.
The built-in regular expressions can recognize word boundaries but require more processing time than simple string literal matching, which cannot. For example, without regard to word boundaries, "boy" would match in "30-day boycott," which is often not what you'd want.
The standard library functions do not scale well to finding very large numbers of keywords. Your choices would be either:
- Perform a separate literal (or regular expression) search operation for each keyword
- Create a regular expression containing alternative clauses for each keyword.

These approaches suffer in terms of performance as the number of keywords increases. The standard library functions are limited to finding character keywords in character texts. There is no way to use built-in functions, for example, to find all sequences of "317, 206, 827, 1106" in an array of 10 million integers.

Each of these scenarios, respectively, is easily handled using Mensa:

- Mensa operates on an abstract text source. Implementations are provided for both in-memory and streaming text sources. Plus, it is easy to create new text source implementations for custom sources. For example, you could create a text source that reads from a data base, a version control system, a REST API, etc.
- Mensa has an option to enable or disable word-boundary checking. When enabled, keywords are recognized only when delimited by word boundary symbols. What constitutes a word boundary symbol is determined by an abstract symbol classifier. An implementation is provided for character symbols, but you can also create custom symbol classifiers. For example, a gene search application may define certain nucleotides as gene boundaries.
- Once a Mensa machine is constructed, its performance depends only on the length of the text being searched, not on the number of keywords. A given machine instance can be used any number of times, even by concurrent threads, to search multiple text sources.
- Mensa is implemented using Java generics. Thus, it can be used to match any type of symbol as defined by the Java template type S. As such, it is possible to create a machine to match bytes, characters, integers, gene sequences, bit sequences, etc.

Mensa use-case example

Suppose your company has an internal company web portal containing many thousands of web pages. You've been asked to write a program to determine how many times each employee is mentioned anywhere within that portal. You have access to an HR database that contains the full names and, in some cases, the nicknames of every employee in the company, of which there are roughly 25,000.

At a very high level, a Mensa-based solution to this problem looks like this:

- 1. Create a Mensa keywords collection.
  2. Read the full names and nicknames of the employees from the HR database, and add each to the keywords collection.
  3. Create a Mensa matching machine.
  4. Initialize it with the keyword collection.
  5. For each page in the web portal, create a Mensa text source based on the page, then run the matching machine against that source. The result is a list of matching keywords (and thus employee names or nicknames) for that page.

Mensa provides the ability to associate arbitrary data, called user data, with each keyword. This additional information is included with any match result. For this application, it would be useful to use this feature to store the employee's identity with each keyword.

This is especially important for nicknames—there could easily be more than one "Andy" in the company, for example. That way, each time the matching machine reports a match, you'll know not only the text of the match, but also the specific employee(s) being mentioned.

Learn more

If this has whetted your pattern matching appetite, hop over to the Mensa project site where you can browse the Mensa Wiki and take a look at the Mensa source code examples. Then, download the latest release and build something cool!