Easy data validation in Perl with Regexp::Common

Take some of the trickiness out of building regular expressions in Perl with the Regexp::Common module.
327 readers like this
327 readers like this
hands programming

WOCinTech Chat. Modified by Opensource.com. CC BY-SA 4.0

Building regular expressions in Perl can be a little bit tricky, particularly for the newcomer. It's a powerful technique, but even experienced Perl developers can sometimes find themselves checking the documentation to make sure they've got it right.

Another common issue with regular expressions lies in the common expressions we use all the time; it seems like we're forever re-inventing the wheel! But, for this problem at least, there is a useful answer.

I asked a fellow developer a while back for a list of the most useful modules, and this one really stood out, as I've had the problem that it aims to solve. Damian Conway's module Regexp::Common sets up a framework for having repeatable, useful regular expressions. Helpfully, it comes with a raft of routines already defined, and it provides tools for rolling up your own patterns for anything your application might need. Let's take a quick look.

Usage

Once you've used the module with use Regexp::Common, you can substitute the included patterns right where you would put your own expression, like so:

if ( $input =~ /$RE{num}{int}/ ){
    print 'yes, it is an integer!';
}
elsif ( $input =~ /$RE{quoted}/ ){
    print 'it is a quoted string!';
}

If you prefer it a different way, you can also use the subroutine-based interface. The same logic from above would look like this:

if ( $input =~ RE_num_int() ){
    print 'yes, it is an integer!';
}
elsif ( $input =~ RE_quoted() ){
    print 'it is a quoted string!';
}

Some of the built-in expressions have parameter settings to let you configure their behaviors, like searching for delimiters, formats of strings, and many other things. To use them, just include them in the call:

# Check for balanced parentheses
if ( $input =~ /$RE{balanced}{-parens=>'()'}/ )  {...}
# or using the subroutine interface:
if ( $input =~ RE_balance(-parens=>'()' ) {...}

One of the really nice patterns I spotted was the call to remove leading and/or trailing whitespace. In 15 or so years of writing Perl, I've seen a whole lot of messy ways to do this, but this, to me, is beautifully clean and elegant:

$input =~ s/$RE{ws}{crop}//g;

Numerous patterns have already been deployed for Regexp::Common, including many sorts of URLs, common string formatting issues, credit card numbers, numbers, whitespace, zip codes, U.S. social security numbers, palindromes, and even profanity! I looked at that last one's source code, and I'm stumped; Damian's regular expression-fu is much stronger than mine, and this isn't just a simple list-matching tool. You can see the full list of included modules on the Regexp::Common release page on MetaCPAN.

Creating your own

You can include the pattern export in your use statement, if you'd like to create your own elements in the $RE hash. Here's an example adapted from the documentation:

use Regexp::Common 'pattern';

pattern name   => ['name', 'mine'],
        create => '(?i:Ruthie)',
        #the 'i' makes it case-insensitive!
        ;

my $input = 'Ruthie, I really need you to finish this article!';
if ($input =~ /$RE{name}{mine}/) {
    print "You got mentioned!\n";
}
$input = 'I can even, ruthie, include it mid-sentence.';
if ($input =~ /$RE{name}{mine}/) {
    print "You got mentioned en passant!\n";
}

If your application work uses regular expressions for data validation, be sure and give Regexp::Common a look, and see if you can save yourself some time and suffering. By adding new modules as needed to Regexp::Common's array of tools, you can have consistent validation throughout a large application. If you write something useful, why not submit it to the maintainers to add? You can find contact information in the Regexp::Common documentation.

Ruth Holloway has been a system administrator and software developer for a long, long time, getting her professional start on a VAX 11/780, way back when. She spent a lot of her career (so far) serving the technology needs of libraries, and has been a contributor since 2008 to the Koha open source library automation suite. Ruth is currently a Perl developer and project lead at Clearbuilt.

2 Comments

Nice to know, Ruth.
I'm trying to figure out how your last examples would know en passant references as being different from the other...the regex looks the same.

In the final code block, the regex is the same for both usages, and so would not be able to discriminate between mid-sentence and start-of-sentence references (or, for that matter, case discrepancies). If you wanted to do that, you could easily create regex patterns for that behavior, and then use Regexp::Common patterns to refer to the two different behaviors. Regexp::Common just serves as a somewhat-more-readable shortcut to regexes you intend to use more than once.

In reply to by Greg Pittman

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.