Using Python to find corrupted images

381 readers like this.
Using Python to find corrupted images

Jason van Gumster. CC BY-SA 4.0

Catch up on this series:

Part 1: Automating repetitive tasks for digital artists with Python

Part 2: Python file-management tricks for digital artists


If you're working with images on a computer, you're bound to eventually run into corrupted files that ruin your day. I run into this with animation renders (remember, the best practice here is to render to a sequence of image files and not a single video file). However, animation and visual effects are not the only places where you see image corruption. You can just as easily run into this in other fields. Perhaps you're a photographer and you've shot a bunch of brackets HDRI (High Dynamic Range Imaging) tone mapping and something glitches when transferring files from your camera.

The problem isn't so much the amount of effort to repair or replace a corrupted image, which is usually just a matter of re-rendering the image or re-copying the good image to your computer, rather the trick is to find those bad images as early in the process as possible. The longer you don't know, the greater the hassle you'll face when you do encounter a corrupt image.

So, what do you do? Well, you could go through and open each file—one at a time—in your image editor or viewer of choice, and let that program tell you there's a problem. However, photograph images are large and it can be annoying and time-consuming to go through a whole set just to find one or two baddies. And although animation renders are typically smaller files, you often have a lot more of them to go through. In my case, I regularly produce renders that have over 44,000 frames in a render. (No, that's not a typo—forty-four thousand frames.)

The solution? You guessed it. Write a script.

As with previous articles in this series, you'll do your scripting in Python. Step one: get a listing of your files. Fortunately, if you've gone through the last article in this series, you know that's a matter of using the os module. Assume that all of the image files you want to inspect are in a single directory on your hard drive. Furthermore, assume that you're going to run this script from within that directory. Using Python, you can get a list of those files with the following code:

import os
    
for filename in os.listdir('./'):
  print(filename)

If you'd like, you can narrow down that list of images (or at least more clearly specify it; for instance, you don't want to include this script as one of those files) by looking just for files that end with the PNG extension:

import os
    
for filename in os.listdir('./'):
  if filename.endswith('.png'):
    print(filename)

You now have a list of PNG image files in your current working directory. Now what? Well, now you need to figure out which, if any, of those images are corrupt. In the previous articles of this series, we exclusively used modules that ship with Python by default. Unfortunately, discovering if an image is corrupt without any image processing capability is difficult, and neither Python 2 nor Python 3 ship with any way to handle that out of the box. You'll need to get yourself an image processing module to view these files. Happily, the Python development community has made that easier for you.

In fact, you have an entire library of packages available to you to install. You just need to know how to get them. Let me introduce you to pip, the recommended tool for installing Python packages. It's installed by default on most platforms when you install Python.

Note: I'm using Python 3, but if you're using Python 2, nearly everything I've written in this series is transferable between both variations of the language. Also, many Linux distributions prefer that you use their own package management system over using pip to install Python packages. Feel free to stick to that if you prefer. The suggestion to use pip here is mostly in the interest of being consistent across all of the platforms you can use Python on.

The specific package that I'm going to recommend that you install is called Pillow. It's a "friendly fork" of the original PIL (Python Imaging Library) that works in current releases of both Python 3 and Python 2. All you need to install Pillow is to fire up a terminal window and type pip install Pillow. The Python package tool should handle the rest for you from there.

Once you have Pillow installed you need to actually have a way of using it in your script. Because it's installed, you can treat it just like any module that comes with Python. You use import—in this case, you could use import PIL. However, to look for corrupt images, you don't really need to import the entirety of the Pillow library into our script. In Python, you can import just a single subcomponent of a module. This is good practice because it reduces the memory footprint of your script and, just as importantly, it makes it more clear what things your script is going to do right from the start. Plus, when you import subcomponents, you end up needing to type less once you get into the meat of your script. Which is always a nice bonus.

To import a subcomponent of a module, you precede your import with a from directive. In the case of Pillow, your script really only needs to use the Image class. So, your import line would look like from PIL import Image. In fact, you can do the same thing with the os module. If you look back at the previous code, you might notice that you're only using the listdir function in the os module. So instead of import os, you could use from os import listdir. This means that when you get into your script, you no longer have to type os.listdir. Instead, you only need to type listdir, because that's all you've imported.

Pulling all that together, your script should now look something like this:

from os import listdir
from PIL import Image
    
for filename in listdir('./'):
  if filename.endswith('.png'):
    print(filename)

You've got the Image class in Pillow loaded, but your script still isn't doing anything with it yet. It's now time to get to the functional section of your script. What you're going to do is the scripted equivalent of opening each image file and checking to see if it's readable. If there's an error, then you've found a bad file. To do that, you're going to use a try/except block. In short, your script is going to try to run a function that opens a file. If that function returns an error, otherwise known as an exception, then you know that image has a problem. In particular, if the exception is of types IOError or SyntaxError, then you know you've got yourself a bad image.

The syntax for doing a try/except is pretty straightforward. I've described it in code comments below:

try: # These next functions may produce an exception
  # <some function>
except (IOError, SyntaxError) as e: # These are the exceptions we're looking for
  # <do something... like print an intelligent error message>

In the case of looking for corrupt image files, you'll want to test two functions: Image.open() and verify(). If you wrap those in a try/except block, your corrupt image-finding script should look like this:

from os import listdir
from PIL import Image
    
for filename in listdir('./'):
  if filename.endswith('.png'):
    try:
      img = Image.open('./'+filename) # open the image file
      img.verify() # verify that it is, in fact an image
    except (IOError, SyntaxError) as e:
      print('Bad file:', filename) # print out the names of corrupt files

And there you go. Save this script in your directory of images. When you run it from the command line, you should get a list of all the corrupt image files in there. If nothing prints out, then you can assume all of those image files are good, valid images.

Of course, being able to use this script on any arbitrary directory would be nice. And having the script prompt you to instruct it to go ahead and delete those corrupt files for you would be even nicer. Good news! You can make the script do exactly that. We'll cover that in the next articles in this series.

In the meantime, have fun rooting out corruption in your image folders.

User profile image.
Jason van Gumster mostly makes stuff up. He writes, animates, and occasionally teaches, all using open source tools. He's run a small, independent animation studio, wrote Blender For Dummies and GIMP Bible, and continues to blurt out his experiences during a [sometimes] weekly podcast, the Open Source Creative Podcast. Adventures (and lies) at @monsterjavaguns.

7 Comments

I never thought to do this. At all. Much less with Python.

Great article, great tip. Thanks!

In my experience a struct.error can also occur, in addition to the ones you mentioned above.

Great tip! I've not come across that myself, but I can definitely see that happening. Well worth adding to the script.

In reply to by Ashwin Vishnu (not verified)

I've used PIL quite a bit related to Scribus. I was wondering whether there might be some strain on resources by serially opening 100, 500 or 1,000 image files. Are the images loaded into memory?

That's a good question, and to be honest I'm not sure that the images are loaded into memory. That said, like I wrote in the article, I regularly use a variation of this script on directories with over 44,000 images in them and I've not encountered any abnormal memory usage. Granted, the machine I run this one has a pretty beefy RAM spec, so I'll definitely need to pay closer attention the next time I run it.

All in all, it probably wouldn't hurt (and would likely be more proper) to add a img.close() at the end of that for loop. That should resolve most of the issue there, I'd think.

In reply to by Greg P

When I read the title, was expecting some sort of checking against checksums (ie MD5, etc)

Out of curiosity, how does verify() function can really check the integrity with nothing else to compare?

Thank you.

If we were *just* talking about images transferred from a device like a camera, then checksums would probably work as a solution. However, in the case of animation frames that are being rendered, it would be difficult to use checksums to do this sort of thing because of exactly the problem you've brought up. We're dealing with original data; there's no "known good" version of the file to compare against. In that case, our question of data integrity changes from "is this the same file as another one we already know about?" to "can this file be read as an image at all?" That's what verify() is for.

In reply to by Monster (not verified)

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.