Open source image recognition with Luminoth

Image by:

Agustin Azzinnari, CC BY

Computer vision is a way to use artificial intelligence to automate image recognition—that is, to use computers to identify what's in a photograph, video, or another image type. The latest version of Luminoth (v. 0.1), an open source computer vision toolkit built in Python and using Tensorflow and Sonnet, offers several improvements over its predecessor:

An implementation of the Single Shot MultiBox Detector (SSD) model, a much faster (although less accurate) object detector than the already included Faster R-CNN. SSD enables object detection in real-time on most modern GPUs to support the processing of video streams, for example.
Some tweaks to the Faster R-CNN model and a new base configuration that allow it to reach results comparable to existing implementations when training on the COCO and Pascal VOC visual object detection datasets.
Checkpoints for both the SSD and Faster R-CNN models trained on the Pascal and COCO datasets, respectively, with state-of-the-art results. This makes object detection in images extremely straightforward, as these checkpoints will be downloaded automatically by the library, even when just using the command-line interface (CLI).
General usability improvements, such as a cleaner CLI for most commands, support for videos on prediction, and a redesign of the included web frontend to make it easier to play around with the models.

Let's explore each of these features by incrementally building our own computer vision image detector.

Installing and testing Luminoth

First, install Luminoth. Inside your virtual environment, run:

$ pip install luminoth

If you have a GPU available and want to use it, first run pip install tensorflow-gpu, then run the installation command above.

Luminoth's new checkpoint functionality provides pre-trained models for both Faster R-CNN and SSD out of the box. This means you can download and use a fully trained object-detection model with just a couple of commands. Let's start by refreshing the checkpoint repository using Luminoth's CLI tool, lumi:

$ lumi checkpoint refresh
Retrieving remote index... done.
2 new remote checkpoints added.
$ lumi checkpoint list
================================================================================
|           id |                  name |       alias | source |         status |
================================================================================
| 48ed2350f5b2 |   Faster R-CNN w/COCO |    accurate | remote | NOT_DOWNLOADED |
| e3256ffb7e29 |      SSD w/Pascal VOC |        fast |  local | NOT_DOWNLOADED |
================================================================================

The output shows all the available pre-trained checkpoints. Each checkpoint is identified with the id field (in this example, 48ed2350f5b2 and e3256ffb7e29) and with a possible alias (e.g., accurate and fast). You can check other information with the command lumi checkpoint detail <checkpoint_id_or_alias>. We're going to try out the Faster R-CNN checkpoint, so we'll download it (by using the alias instead of the ID) and then use the lumi predict command:

$ lumi checkpoint download accurate
Downloading checkpoint...  [####################################]  100%
Importing checkpoint... done.
Checkpoint imported successfully.
$ lumi predict image.png
Found 1 files to predict.
Neither checkpoint not config specified, assuming `accurate`.
Predicting image.jpg... done.
{
  "file": "image.jpg",
  "objects": [
    {"bbox": [294, 231, 468, 536], "label": "person", "prob": 0.9997},
    {"bbox": [494, 289, 578, 439], "label": "person", "prob": 0.9971},
    {"bbox": [727, 303, 800, 465], "label": "person", "prob": 0.997},
    {"bbox": [555, 315, 652, 560], "label": "person", "prob": 0.9965},
    {"bbox": [569, 425, 636, 600], "label": "bicycle", "prob": 0.9934},
    {"bbox": [326, 410, 426, 582], "label": "bicycle", "prob": 0.9933},
    {"bbox": [744, 380, 784, 482], "label": "bicycle", "prob": 0.9334},
    {"bbox": [506, 360, 565, 480], "label": "bicycle", "prob": 0.8724},
    {"bbox": [848, 319, 858, 342], "label": "person", "prob": 0.8142},
    {"bbox": [534, 298, 633, 473], "label": "person", "prob": 0.4089}
  ]
}

The lumi predict command defaults to using the checkpoint with alias accurate, but we could specify otherwise using the option --checkpoint=<alias_or_id>. After about 30 seconds on a modern CPU, here is the output:

People and bikes detected with the Faster R-CNN model.

You can also write the JSON output to a file (through the --output or -f option) and make Luminoth store the image with the bounding boxes drawn (through the --save-media-to or the -d option).

Now in real time

Unless you're reading this several years in the future (hello from the past!), you probably noticed Faster R-CNN took quite a while to detect the objects in the image. That is because this model favors prediction accuracy over computational efficiency, so it's not feasible to use it for things like real-time processing of videos (especially if you don't have modern hardware). Even on a pretty fast GPU, Faster R-CNN won't do more than two to five images per second.

Enter the Single-Shot MultiBox Detector. This model trades lower accuracy (which increases with the more classes you want to detect) for speed: around 60 images per second on the same hardware used above, making it suitable for running over video streams or videos in general.

Let's try it out. Run lumi predict again, but this time with the fast checkpoint. Also, this time we won't download it beforehand; the CLI will notice the command and look for it in the remote repository.

$ lumi predict video.mp4 --checkpoint=fast --save-media-to=.
Found 1 files to predict.
Predicting video.mp4  [####################################]  100%     fps: 45.9

Single Shot MultiBox Detector model applied to a dog playing fetch.

It's much faster! The command will generate a video by running SSD on a frame-by-frame basis, so there are no fancy temporal-prediction models (at least for now). In practice, this means you'll probably see some jittering in the boxes, as well as some predictions appearing and disappearing out of nowhere, but it's nothing some post-processing can't fix.

Train your own model

Say you want to detect cars outside your window, and you aren't interested in the 80 classes present in COCO. Training your model to detect a lower number of classes may improve detection quality, so let's do that. Note, however, that training on a CPU may take quite a while, so be sure to use a GPU or a cloud service such as Google's ML Engine (which Luminoth integrates with).

Luminoth contains tools to prepare and build a custom dataset from standard formats, such as the ones used by COCO and Pascal. You can also build your own dataset transformer to support your own format, but that's beyond the topic of this article. For now, we'll use the lumi dataset CLI tool to build a dataset containing only cars, taken from both COCO and Pascal (2007 and 2012).

Start by downloading the datasets from Pascal 2007, Pascal 2012, and COCO and store them in datasets/ directories created in your working directory (specifically: datasets/pascal/2007/, datasets/pascal/2012/, and datasets/coco/). Then run the following commands to merge all the data into a single .tfrecords file ready to be consumed by Luminoth:

$ lumi dataset transform \
        --type pascal \
        --data-dir datasets/pascal/VOCdevkit/VOC2007/ \
        --output-dir datasets/pascal/tf/2007/ \
        --split train --split val --split test \
        --only-classes=car
$ lumi dataset transform \
        --type pascal \
        --data-dir datasets/pascal/VOCdevkit/VOC2012/ \
        --output-dir datasets/pascal/tf/2012/ \
        --split train --split val \
        --only-classes=car
$ lumi dataset transform \
        --type coco \
        --data-dir datasets/coco/ \
        --output-dir datasets/coco/tf/ \
        --split train --split val \
        --only-classes=car
$ lumi dataset merge \
        datasets/pascal/tf/2007/classes-car/train.tfrecords \
        datasets/pascal/tf/2012/classes-car/train.tfrecords \
        datasets/coco/tf/classes-car/train.tfrecords \
        datasets/tf/train.tfrecords
$ lumi dataset merge \
        datasets/pascal/tf/2007/classes-car/val.tfrecords \
        datasets/pascal/tf/2012/classes-car/val.tfrecords \
        datasets/coco/tf/classes-car/val.tfrecords \
        datasets/tf/val.tfrecords

Now we're ready to start training. To train a model using Luminoth, you must create a configuration file specifying some required information (such as a run name, the dataset location, and the model to use, as well as a battery of model-dependent hyperparameters). Since Luminoth provides base configuration files, something like this will be enough:

train:
  run_name: ssd-cars
  # Directory in which model checkpoints & summaries (for Tensorboard) will be saved.
  job_dir: jobs/

  # Specify the learning rate schedule to use. These defaults should be good enough.
  learning_rate:
    decay_method: piecewise_constant
    boundaries: [1000000, 1200000]
    values: [0.0003, 0.0001, 0.00001]

dataset:
  type: object_detection
  # Directory from which to read the dataset.
  dir: datasets/tf/

model:
  type: ssd
  network:
    # Total number of classes to predict. One, in this case.
    num_classes: 1

Store it in your working directory (where datasets/ is located) as config.yml. As you can see, we're going to train an SSD model. Run the following:

$ lumi train -c config.yml
INFO:tensorflow:Starting training for SSD
INFO:tensorflow:Constructing op to load 32 variables from pretrained checkpoint
INFO:tensorflow:ImageVisHook was created with mode = "debug"
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into jobs/ssd-cars/model.ckpt.
INFO:tensorflow:step: 1, file: b'000004.jpg', train_loss: 20.626895904541016, in 0.07s
INFO:tensorflow:step: 2, file: b'000082.jpg', train_loss: 12.471542358398438, in 0.07s
INFO:tensorflow:step: 3, file: b'000074.jpg', train_loss: 7.3356428146362305, in 0.06s
INFO:tensorflow:step: 4, file: b'000137.jpg', train_loss: 8.618950843811035, in 0.07s
(ad infinitum)

Many hours later, the model should have some reasonable results (you can stop it when it goes beyond 1 million or so steps). You can test it right away using the built-in web interface by running the following command:

$ lumi server web -c config.yml
Neither checkpoint not config specified, assuming 'accurate'.
 * Running on https://127.0.0.1:5000/ (Press CTRL+C to quit)

Luminoth's frontend with cars detected

Since Luminoth is built upon Tensorflow, you can also leverage Tensorboard by running it on the job_dir specified in the config if you want to see the training progress.

Learn more

In this overview, we've used Luminoth to detect objects in images and videos using pre-trained models, and we even trained our own with a couple of commands. We limited ourselves to the CLI tool and didn't even get into the Python API, from which you can use the trained models as part of a larger system.

If you'd like to learn more, check out the documentation, which contains even more examples of using Luminoth.