Automate checking for flaws in Python with Thoth

Project Thoth pulls together many open source tools to automate program builds with security checks as part of the resolution process.
6 readers like this.
Tips and gears turning

Most cyberattacks take advantage of publicly known vulnerabilities. Many programmers can automate builds using Continuous Integration/Continuous Deployment (CI/CD) or DevOps techniques. But how can we automate the checks for security flaws that turn up hourly in different free and open source libraries? Many methods now exist to ferret out buggy versions of libraries when building an application.

This article will focus on Python because it boasts some sophisticated tools for checking the security of dependencies. In particular, the article explores Project Thoth because it pulls together many of these tools to automate Python program builds with security checks as part of the resolution process. One of the authors, Fridolín, is a key contributor to Thoth.

Inputs to automated security efforts

This section lists efforts to provide the public with information about vulnerabilities. It focuses on tools related to the article's subject: Reports of vulnerabilities in open source Python libraries.

Common Vulnerabilities and Exposures (CVE) program

Any discussion of software security has to start with the comprehensive CVE database, which pulls together flaws discovered by thousands of scattered researchers. The other projects in this article depend heavily on this database. It's maintained by the U.S. National Institute of Standards and Technology (NIST), and additions to it are curated by MITRE, a non-profit corporation specializing in open source software and supported by the U.S. government. The CVE database feeds numerous related projects, such as the CVE Details statistics site.

A person or automated tool can find exact packages and versions associated with security vulnerabilities in a structured format, along with less structured text explaining the vulnerability, as seen below.

CVE sample

(Fridolín Pokorný and Andy Oram, CC BY-SA 4.0)

Security efforts by the Python Packaging Authority

The Python Packaging Authority (PyPA) is the major organization creating best practices for open source packages in the Python language. Volunteers from many companies support PyPA. Security-related initiatives by PyPA are significant advances in making Python robust.

PyPA's Advisory Database curates known vulnerabilities in Python packages in a machine-readable form. Yet another project, pip-audit, supported by PyPA, audits application requirements and reports any known vulnerabilities in the packages used. Output from pip-audit can be in both human-readable and structured formats such as JSON. Thus, automated tools can consult the Advisory Database or pip-audit to warn developers about the risks in their dependencies.

A video by Dustin Ingram, a maintainer of PyPI, explains how these projects work.

Open Source Insights

An initiative called Open Source Insights tries to help open source developers by providing information in structured formats about dependencies in popular language ecosystems. Such information includes security advisories, license information, libraries' dependencies, etc.

To exercise Open Source Insights a bit, we looked up the popular TensorFlow data science library and discovered that (at the time of this writing) it has a security advisory on PyPI (see below). Clicking on the MORE DETAILS button shows links that can help research the advisory (second image).

Sample security advisory for a dependency

(Fridolín Pokorný and Andy Oram, CC BY-SA 4.0)

Advisory details

(Fridolín Pokorný and Andy Oram, CC BY-SA 4.0)

Interestingly, the version of TensorFlow provided by the Node.js package manager (npm) had no security advisories at that time. The programming languages used in this case may be the reason for the difference. However, the apparent inconsistency reminds us that provenance can make a big difference, and we'll show how an automated process for resolving dependencies can adapt to such issues.

Open Source Insights obtains dependency information on Python packages by installing them into a clean environment. Python packages are installed by the pip resolver—the most popular installation tool for Python libraries—from PyPI, the most popular index listing open source Python libraries. Vulnerability information for each package is retrieved from the Open Source Vulnerability database (OSV). OSV acts as a triage service, grouping vulnerabilities across multiple language ecosystems.

Open Source Insights would be a really valuable resource if it had an API; we expect that the developers will add one at some point. Even though the information is currently available only as web pages, the structured format allows automated tools to scrape the pages and look for critical information such as security advisories.

Security Scorecards by the Open Source Security Foundation

Software quality—which is intimately tied to security—calls for basic practices such as conducting regression tests before checking changes into a repository, attaching cryptographic signatures to releases, and running static analysis. Some of these practices can be detected automatically, allowing security experts to rate the security of projects on a large scale.

An effort called Security Scorecards, launched in 2020 and backed by the Open Source Security Foundation (OpenSSF), currently lists a couple of dozen such automated checks. Most of these checks depend on GitHub services and can be run only on projects stored in GitHub. The project is still very useful, given the dominance of GitHub for open source projects, and represents a model for more general rating systems.

Project Thoth

Project Thoth is a cloud-based tool that helps Python programmers build robust applications, a task that includes security checking along with many other considerations. Red Hat started Thoth, and it runs in the Red Hat OpenShift cloud service, but its code is entirely open source. The project has built up a community among Python developers. Developers can copy the project's innovations in other programming languages.

A tool that helps programmers find libraries and build applications is called a resolver. The popular pip resolver generally picks the most recent version of each library, but is sophisticated enough to consider the dependencies of dependencies in a hierarchy called a dependency graph. pip can even backtrack and choose a different version of a library to handle version range specifications found by traversing the dependency graph.

When it comes to choosing the best version of a dependency, Thoth can do much more than pip. Here is an overview of Thoth with a particular eye to how it helps with security.

Thoth overview

Thoth considers many elements of a program's environment when installing dependencies: the CPU and operating system on which the program will run, metadata about the application's container such as the ones extracted by Skopeo, and even information about the GPU that a machine learning application will use. Thoth can take into account several other variables, but you can probably guess from the preceding list that Thoth was developed first to support machine learning in containers. The developer provides Thoth with information about the application's environment in a configuration file.

What advantages does the environment information give? It lets Thoth exclude versions of libraries with known vulnerabilities in the specified environment. A developer who notices that a build fails or has problems during a run can store information about what versions of dependencies to use or avoid in a specification called a prescription, consulted by Thoth for future users.

Thoth can even run tests on programs and their environments. Currently, it uses Clair to run static testing over the content of container images and stores information about the vulnerabilities found. In the future, Thoth's developers plan to run actual applications with various combinations of library versions, using a project from the Python Code Quality Authority (PyCQA) named Bandit. Thoth will run Bandit on each package source code separately and combine results during the resolution process.

The different versions of the various libraries can cause a combinatorial explosion (too many possible combinations to test them all). Thoth, therefore, models dependency resolution as a Markov Decision Process (MDP) to decide on the most productive subset to run.

Sometimes security is not the primary concern. For instance, perhaps you plan to run a program in a private network isolated from the Internet. In that case, you can tell Thoth to prioritize some other benefit, such as performance or stability, over security.

Thoth stores its dependency choices in a lock file. Lock files "lock in" particular versions of particular dependencies. Without the lock files, subtle security vulnerabilities and other bugs can creep into the production application. In the worst case, without locking, users can be confronted with so-called "dependency confusion attacks".

For instance, a resolver might choose to get a library from an index with a buggy version because the index from which the resolver usually gets the dependency is temporarily unavailable.

Another risk is that an attacker might bump up a library's version number in an index, causing a resolver to pick that version because it is the most recent one. The desired version exists in a different index but is overlooked in favor of the one that seems more up-to-date.


Thoth is a complicated and growing collection of open source tools. The basic principles behind its dependency resolutions can be an inspiration for other projects. Those principles are:

  1. A resolver should routinely check for vulnerabilities by scraping websites such as the CVE database, running static checks, and through any other sources of information. The results must be stored in a database.
  2. The resolver has to look through the dependencies of dependencies and backtrack when it finds that some bug or security flaw calls for changing a decision that the resolver made earlier.
  3. The resolver's findings and information passed back by the developers using the resolver should be stored and used in future decisions.

In short, with the wealth of information about security vulnerabilities available these days, we can automate dependency resolution and produce safer applications.

Profile pic
Fridolin works as a Senior Software Engineer in Red Hat's Office of the CTO, emerging technologies group. His background is in the security domain and his interests range from security, machine-learning to C/C++, Python, and distributed systems. Besides that, he loves nature, road cycling, and good books.
Andy Oram
Andy is a writer and editor in the computer field. His editorial projects at O'Reilly Media ranged from a legal guide covering intellectual property to a graphic novel about teenage hackers. Andy also writes often on health IT, on policy issues related to the Internet, and on trends affecting technical innovation and its effects on society.

1 Comment

Thank you, great guide!

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.