3 Python web scrapers and crawlers

Check out these great Python tools for crawling and scraping the web, and parsing out the data you need.
1154 readers like this.
Image of spider web

You as a Machine. Modified by Rikki Endsley. CC BY-SA 2.0.

In a perfect world, all of the data you need would be cleanly presented in an open and well-documented format that you could easily download and use for whatever purpose you need.

In the real world, data is messy, rarely packaged how you need it, and often out-of-date.

Often, the information you need is trapped inside of a website. While some websites make an effort to present data in a clean, structured data format, many do not. Crawling, scraping, processing, and cleaning data is a necessary activity for a whole host of activities from mapping a website's structure to collecting data that's in a web-only format, or perhaps, locked away in a proprietary database.

Sooner or later, you're going to find a need to do some crawling and scraping to get the data you need, and almost certainly you're going to need to do a little coding to get it done right. How you do this is up to you, but I've found the Python community to be a great provider of tools, frameworks, and documentation for grabbing data off of websites.

Before we jump in, just a quick request: think before you do, and be nice. In the context of scraping, this can mean a lot of things. Don't crawl websites just to duplicate them and present someone else's work as your own (without permission, of course). Be aware of copyrights and licensing, and how each might apply to whatever you have scraped. Respect robots.txt files. And don't hit a website so frequently that the actual human visitors have trouble accessing the content.

With that caution stated, here are some great Python tools for crawling and scraping the web, and parsing out the data you need.

Pyspider

Let's kick things off with pyspider, a web-crawler with a web-based user interface that makes it easy to keep track of multiple crawls. It's an extensible option, with multiple backend databases and message queues supported, and several handy features baked in, from prioritization to the ability to retry failed pages, crawling pages by age, and others. Pyspider supports both Python 2 and 3, and for faster crawling, you can use it in a distributed format with multiple crawlers going at once.

Pyspyder's basic usage is well documented including sample code snippets, and you can check out an online demo to get a sense of the user interface. Licensed under the Apache 2 license, pyspyder is still being actively developed on GitHub.

MechanicalSoup

MechanicalSoup is a crawling library built around the hugely-popular and incredibly versatile HTML parsing library Beautiful Soup. If your crawling needs are fairly simple, but require you to check a few boxes or enter some text and you don't want to build your own crawler for this task, it's a good option to consider.

MechanicalSoup is licensed under an MIT license. For more on how to use it, check out the example source file example.py on the project's GitHub page. Unfortunately, the project does not have robust documentation at this time

Scrapy

Scrapy is a scraping framework supported by an active community with which you can build your own scraping tool. In addition to scraping and parsing tools, it can easily export the data it collects in a number of formats like JSON or CSV and store the data on a backend of your choosing. It also has a number of built-in extensions for tasks like cookie handling, user-agent spoofing, restricting crawl depth, and others, as well as an API for easily building your own additions.

For an introduction to Scrapy, check out the online documentation or one of their many community resources, including an IRC channel, Subreddit, and a healthy following on their StackOverflow tag. Scrapy's code base can be found on GitHub under a 3-clause BSD license.

If you're not all that comfortable with coding, Portia provides a visual interface that makes it easier. A hosted version is available at scrapinghub.com.

Others

  • Cola describes itself as a “high-level distributed crawling framework” that might meet your needs if you're looking for a Python 2 approach, but note that it has not been updated in over two years.

  • Demiurge, which supports both Python 2 and Python 3, is another potential candidate to look at, although development on this project is relatively quiet as well.

  • Feedparser might be a helpful project to check out if the data you are trying to parse resides primarily in RSS or Atom feeds.

  • Lassie makes it easy to retrieve basic content like a description, title, keywords, or a list of images from a webpage.

  • RoboBrowser is another simple library for Python 2 or 3 with basic functionality, including button-clicking and form-filling. Though it hasn't been updated in a while, it's still a reasonable choice.


 

This is far from a comprehensive list, and of course, if you're a master coder you may choose to take your own approach rather than use one of these frameworks. Or, perhaps, you've found a great alternative built for a different language. For example, Python coders would probably appreciate checking out the Python bindings for Selenium for sites that are trickier to crawl without using an actual web browser. If you've got a favorite tool for crawling and scraping, let us know in the comments below.

User profile image.
Jason was an Opensource.com staff member and Red Hatter from 2013 to 2022. This profile contains his work-related articles from that time. Other contributions can be found on his personal account.

4 Comments

Good

A few years ago I started with Beautiful Soup.

For one recent project, started 2 years ago and still in daily use, I used Selenium.
With Selenium, it is easier to debug because you can see what is happening in a browser and how your spider is crawling.
After debug was done I used Selenium in headless mode (with phantomjs), it reduced scraping time from 2h to 1h.

Thanks for the summary Jason! By the way, the documentation of MechanicalSoup has improved significantly in the past few months. There's now an extensive Read the Docs site: https://mechanicalsoup.readthedocs.io/en/latest/

What’s the best spider that will index into elasticsearch 5+ ?

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.