3 open source code libraries to handle MARC-formatted records

Developers can use these libraries for Java, C#, and Perl.

3 cool machine learning projects using TensorFlow and the Raspberry Pi

Image by:

Opensource.com

Welcome back to Nooks & Crannies! After a month off for my wedding, I've been digging around for some interesting bits for upcoming columns. This month, I'll take a look at some open source code libraries that developers can use to handle MARC-formatted records.

A little background for the MARC novice

MARC stands for MAchine Readable Cataloging records. It's a format first developed in the 1960s for the U.S. Library of Congress in order to facilitate the exchange of bibliographic records among libraries. By the mid-1970s, it was an international standard, used around the world.

There are several variants of the MARC format. MARC21 was a merger in the 1990s between USMARC and CANMARC, the US and Canadian variants then in use, and other countries have their own formats. In much of Europe, UNIMARC is the variant most often seen. All of these records are formatted the same, with a structure of tags that are used to contain information, a directory which tells what tags are in the record, and where they are located.

Each tag, in each format, means something specific. For instance, in MARC21 bibliographic format, the 245 tag holds information about the title of the work. Additional information, including the publisher, author, size of the physical book, publication date, and subjects, are contained in other tags.

The format of the record, if you were to just print it out, is kind of hard to read. It was originally designed for serial interchange, via 9-track tape, and that medium was still in use in the early days of my career, in the 1990s. The first five bytes of the record are digits and tell you how long the record is, in bytes—including those five bytes. The clever modern nerd will instantly perceive the limitation of this structure: the record cannot be 100,000 bytes in length. Following that is the directory of tags, telling what tags to look for, and at which byte each tag starts. After that comes the tag data, and the next byte after that is the first byte of the next record. The leader/directory/tag structure is generically defined in ISO-2709; MARC21 or UNIMARC are the formats that define the meanings of the tags.

Yes, it's a poorly designed format by modern standards. Yes, it needs updating, in the worst way, but that's the subject of another article altogether. In this article, I'll show you three code libraries that you can use to manipulate MARC records without having to know all the nitty-gritty of the arcane tag directory.

Java: MARC4J

MARC4J allows the creation of an iterator to read an input stream such as a file, and do things with the MARC21 or UNIMARC records that it finds in the stream. There are record-writing tools, too, of course, and iterators for examining the records in detail. Here's a quick example that will read in a file of records, and if the title of the work in field 245, subfield a, starts with the letter J, writes it to another file:

import org.marc4j.MarcReader;
import org.marc4j.MarcStreamReader;
import org.marc4j.MarcStreamWriter;
import org.marc4j.marc.Record;
import org.marc4j.marc.DataField;
import org.marc4j.marc.Subfield;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;

public class JMarcExample {

  public static void main(String args[]) throws Exception {

    InputStream  in  = new FileInputStream("inputfile.mrc");
    OutputStream out = new FileOutputStream("outputfile.mrc");
    MarcReader reader = new MarcStreamReader(in);
    MarcWriter writer = new MARCStreamWriter(out);
    while (reader.hasNext()) {
      Record record = reader.next();
      datafield = (DataField) record.getVariableField("245");
      list subfields = datafield.getSubfields();
      i = subfields.iterator();
         
      while (i.hasNext()) {
        Subfield subfield = (Subfield) i.next();
        char code = subfield.getCode();
        if ( code == 'a' ) {
          String data = subfield.getData();
          if ( data.startsWith("J") ) {
            writer.write(record);
          }
        }
      }
    }
  }
}

MARC4J also includes handlers for Unicode, and for the MARCXML variant (where MARC records are rendered in XML) the tag structure is easier to read for human eyes, but it's much wordier, as you can imagine. MARC4J is agnostic about what the 245 tag might actually mean, so, in that sense, it should be able to read and write any ISO-2709-formatted record.

MARC4J is licensed under LGPL V2.1 and is available on GitHub.

C#: CSharp_MARC

CSharp_MARC has a rich set of tools for importing and exporting MARC21 and MARCXML records, including record validation and search-and-replace tools that allow for the batch editing of records. It also has some reporting tools built right in, to report on copyright year or classifications. It's very lightweight, capable of handling up to 28,000 records per minute.

Here's a sample program, to read a file of MARC21 records, and print out the author name from each record's 100 tag, subfield a, in the order the records appear:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using MARC;
using System.IO;

namespace CSharp_Show_Authors
{
  class Program
  {
    static void Main(string[] args)
    {
      string rawMarc = File.ReadAllText("inputfile.mrc");
      FileMARC marcRecords = new FileMARC(rawMarc);
      foreach (Record record in marcRecords)
      {
        Field authorField = record["100"];
        if (authorfield.IsDataField())
        {
          DataField authorDataField = (Datafield)authorField;
          Subfield authorName = authorDataField['a'];
          Console.WriteLine(authorName.Data);
        }
        else if (authorField.IsControlField())
        {
          //unreachable
          Console.WriteLine("Something awful has happened. The author field should never be a control field!");
        }
      }
    }
  }
}

As with MARC4J, the CSharp_MARC reader and writer tools don't really care what each tag actually means in the bibliographic record, so should be usable for UNIMARC or other MARC variants. However, the built-in validation tools appear to be constrained to MARC21 and MARCXML. CSharp_MARC is licensed GPL V3.0 and is available on GitHub.

Perl 5: MARC::Record

You didn't really think I was going to let this article go by without some Perl, did you? MARC::Record, like the other tools here, has mechanisms for handling any ISO-2709-formatted record reading or writing needs you might have. It's got a built-in pretty-printer, it can handle insertion or deletion of fields into a record, and will properly update the tag directory on output. It's not as feature-rich as the C# library but can handle most basic record-manipulation needs. I've used this library for years in my own work. (Disclaimer: MARC::Record is maintained by my great friend and colleague, Galen Charlton.)

Here is an example script to read a file of MARC21 records, and write out a pipe-delimited file of the author (100 subfield a) and title (245 subfield a):

use strict;
use warnings;
use MARC::File::USMARC;
use MARC::Record;
use MARC::Batch;
use MARC::Charset;

my $in_fh  = IO::File->new("inputfile.mrc");
my $batch = MARC::Batch->new('USMARC',$in_fh);
$batch->warnings_off();
$batch->strict_off();
my $iggy  = MARC::Charset::ignore_errors(1);
my $setting = MARC::Charset::assume_encoding('marc8');
open my $out_fh,">:utf8","outputfile.psv";

RECORD:
while () {
  my $this_record = $batch->next();
  last RECORD unless ($this_record);
  my $author = $this_record->field('100')->subfield('a');
  my $title  = $this_record->field('245')->subfield('a');
  print $out_fh "$author|$title\n";
}
close $in_fh;
close $out_fh;

MARC::Record is available on CPAN and is available under the Perl license. To date, no one has written a MARC handler for Perl 6; I can hear a half-dozen of my colleagues yelling "well volunteered!" at me about now...

I did some open source searching

I did a little bit of digging around on GitHub, and quickly found libraries in Python, JavaScript, Ruby, Node.js, and Scala. I didn't test any of them out, so they might be incomplete or partial, but for any language you want to code in, it's likely you can find a module that will work for you.