The source code is the license

For open source software, license information is embedded in the source code. To reduce complexity, you can generate different views.

Platform wars: software patents in a new light

Image by:

Opensource.com

You can find the license information for open source software by looking at the source code. Different views, or reports, of that license information can be generated to address differing needs.

While providing license information directly in the source code is not a requirement for open source software, the practical benefits of doing so became apparent early. As open source licenses facilitate movement of software, license information that travels with the code simplifies administration by making the statements of the permissions readily available to those who have the code, even if they receive the code indirectly.

What are the license terms?

The value of embedding the license information in the source tree is underappreciated. Let us pause and reflect for a moment how useful this common practice has been [insert a moment of silence here].

What are the license terms? For much open source software, there is a simple answer: A single license text contains all the license information for the entire body of software. But the power of open source is that it facilitates other developers building upon that starting point, and that process can complicate license information.

Open source software can be extended, repurposed, and combined with other software. Unlike mechanical devices, on which collaboration by a diverse group is more challenging, it is practical for complex software to benefit from the work of many. Open source licenses provide the permissions to facilitate that development dynamic. Software with a complex history may also have complex license information.

Consider the following example: Someone writes a new program, including in each source file a copyright notice with a statement that the software is licensed under the Apache License, version 2.0, and including in the root of the source tree a copy of the text of the Apache License. Later, a file is added with a different copyright notice and a copy of a BSD 2-clause license. Then a new subdirectory is added, in which files have Apache license statements, but with copyright notices that identify a different copyright owner. Then a copy of an MIT License is added to a new subdirectory that includes files with copyright notices that are the same as in the MIT License file, but without any other license indication.

This example shows that license information embedded in a source tree can be complex and detailed. There may be license texts in the root and/or in various subdirectories. Some source files may have license notices; others may not. There may be copyright notices identifying various copyright holders. Separating the legal-looking bits from the code may not be possible without loss of information. Thus, the source code is the license.

Seen in the context of the source tree, interpretation of the license information in the example above is fairly straightforward. However, it would be challenging to capture that license information in a simple and unambiguous standalone statement. A license statement that captures all the license information present in the source code would be shorter than the source code, but it would be awkward—who would want such a highly detailed standalone statement? Most users would likely prefer a summary that, while incomplete, captures elements that match their own particular interests and sensitivities.

Summarizing license information: views

Responding to "What are the license terms?" with a copy of the full source tree may not be seen as helpful as it is bulky and dilute. Most people want a summary. But there is a challenge: When the license information is complex, people want different summaries because they have differing ideas of what is important.

For some, answering "yes" to the following questions might be adequate: Is the software 1) licensed under one or more open source licenses, and 2) assembled and licensed such that its distribution and use is consistent with all those licenses? Others may want a list of all the licenses, or they may want to see which software component corresponds to which license. Still others might want a component-by-component list that identifies any copyleft licenses (perhaps to do their own deep dive into copyleft compliance). And some might have an interest in seeing all the copyright notices and associated lists of software components.

A single summary will likely not address the interests of all. Simply making the summary more detailed may reduce its utility to some while remaining inadequate for others. Thus, there is a need for different "views" of the license information that is expressed in the source code. Think of the term view here as similar to how it is used in reference to databases. Alternatively, you might think of views as "reports."

There is an advantage in thinking of (a) the source code as the license, and (b) there being multiple different views that may be extracted from it.

You might try to create a "do-everything" summary, from which other shorter summaries could be created. But an intermediate representation of license information has at least three shortcomings:

Timing: The maintainer of that master summary may not update on your schedule.
Versions: The master summary may be based on a different version of the software than what you use.
Quality: Your view inherits the error and judgment characteristics of the master.

Thus, there is value in generating your preferred view on demand, directly from the version of the source tree that you use.

Tools can generate views. On-demand view generation depends on tooling. The efficacy of that tooling is facilitated (or impeded) by the clarity (or confusion) of how license information is represented. We do not need machine-specific coding of license information, but we should take advantage of the many sources of experience representing information in ways that are both human-readable and machine-extractable.

In his article, An economically efficient model for open source software license compliance, Jeff Kaufman makes a related point: Because the source code contains the license information, distributing source code can be an efficient way of meeting certain license requirements.

Embedding all the license information in the source tree is a best practice. If you discover that license information is not represented in the source tree, consider improving the project by submitting a bug report recommending that that information be added to the source tree.

The source code is the license. From that complete record, views of license information can be generated. Tools can extract license information into various reports to meet particular needs or sensitivities.

We have work to do to obtain the full benefits of this vision. What is your sense of the state of tools and license information representation?