REUSE Machine Readable License Information

Some weeks ago I wrote about SPDX identifiers and how they can be used to annotate source code files with machine readable license information. In this post, now I want to compile the things I learned after looking more deeply into this topic and how it might be applied to KDE.

SPDX identifiers are an important step in order to allow tools an automatic reading and checking of license information. However, as most norms, the SPDX specification is quite general, for many people cumbersome to read and allows many options on how to use the identifiers; while me as a developer, I just want to have a small howto that explains how I have to state my license inormation. Another point is that in my experience any source code annotation with machine readable information is pointless unless you have a tool that automatically checks the correctness. Otherwise, there is a big burden on code reviews that would have to check tiny syntactical requirements from a specification. — If you look deeply into the used license headers in KDE (I did this), there is a shocking number of different license statements that often state exactly the same. This might be due to different formatting or typos but also due to actual faults when trying to pick the correct license for a use case, which somehow got mixed up.

REUSE.software

While doing research on best practices for applying machine readable license information, I was pointed to the REUSE.software initiative, which was started by the Free Software Foundation Europe (FSFE) to provide a set of recommendations to make licensing easier. What they provide is (in my opinion) a really good policy for how to uniformly state SPDX based license headers in source files, how to put license texts into the repository and a way to automatically check the syntactical correctness of the license statements with a small conformance testing tool.

I really like the simplicity of their approach, where I mean simplicity in the amount of documentation you have to read to understand how to do it correctly.

Meanwhile in KF5…

As already said, I want to see machine readable license information in KDE Frameworks in order to increase their quality and to make them easier to use outside of KDE. The first step to be done before introducing any system for machine readable identifiers is to understand what we have inside our repositories right now.

Disclaimer: I know that there are many license parsing tools out there in the wild and I know that several of them are even well established.

Yet, I looked into what we have inside our KF5 repositories and what has to be detected: Most of our licenses are GPL*, LGPL*, BSD-*-Clause, MIT or a GPL/LGPL variant with the “any later version accepted by the membership of KDE e.V. (or its successor approved by the membership of KDE e.V.), which shall act as a proxy […] of the license.” addition. After a lot of reasoning, I came to the result that for the specific use case of detecting the license headers inside KDE project (even focused only on frameworks right now) it makes most sense to have a small tool only working for this use case. The biggest argument for me was that we need a way to deal with the many historic license statements from up to 20 years ago.

Thus, I started a small tool in a scratch repository, named it licensedigger and started the adventure to parse all license headers of KDE Frameworks. From all source files right now I am done with about 90%. Yet, I neglected to look into KHTML, KJS, KDE4LibsSupport and the other porting aid frameworks until now. Specifically for the Tier 1 and Tier 2 frameworks I am mostly done and even beasts like KIO can be detected correctly right now. Thus, I am still working on it to increase the number of headers to be detected.

The approach I took is the following:

  • For every combination of licenses there is one regular expression (which only has the task to remove whitespace and any “*” characters).
  • For every license check there is a unit test consisting of a plaintext license header and a original source code file that guarantees that the header is found.
  • Licenses are internally handled with SPDX markers.
  • For a new license or a license header statement for an already contained license, the license name must be stated multiple times to ensure that copy-past errors with licenses are minimized.
  • It is OK if the tool only detects ~95% of the license headers, marks unknown headers clearly and requires that the remaining 2-3 files per repository have to be identified by hand.

At the moment, the tool can be run to provide a list of license for any folder structure, e.g. pointing it to “/opt/kde/src/frameworks/attica” or even on “/opt/kde/src/frameworks” will produce a long list of file names with their licenses. A next (yet simple) step will be to introduce a substitution mode that replaces the found headers with SPDX markers and further to add the license files in a way that is compatible with REUSE.

Please note that there was no discussion yet on the KDE mailing list if this (i.e. the REUSE way) is the right way to go. But I will send respective mails soon. This post is mostly meant to provide a little bit of background before starting a discussion, such that I can keep mails shorter.

The Road Towards KF6 & SPDX License Identifiers

Last weekend, I had the opportunity to join the planning sprint for KDE Frameworks 6 in Berlin. KF6 will be the next major version release of the KDE Frameworks (a set of add-on libraries to make your life much easier when developing libraries on top of Qt), which will be based on Qt6. There are several blogs out in the wild about the goals for this release. Mainly, we aim for the following:

  • Getting a better separation between logic and platform UI + backend, which will help much on non-Linux systems as Android, MacOS, and Windows.
  • Cleaning up dependencies and making it easier to use the existing Tier 3 frameworks. Note that the Framework libraries are organized in Tiers, which define a layer based dependency tree. Tier 1 libraries may only depend on Qt; Tier 2 libraries may depend on Qt and Tier 1 libraries; and Tier 3 libraries may depend on Qt, Tier 1 and Tier 2 libraries — you see the problem with Tier 3 😉

For details about the framework splittings and cleanups I want to point to the excellent blog posts by David, Christoph 1 / 2 / 3, Kevin, Kai Uwe, Volker, and Nico. However, in this post I want to focus on one of my pet projects in the KF6 cleanup:

Software Package Data Exchange (SPDX)

With KF6, I want to see SPDX license identifiers being introduced into KDE frameworks in order to ease the framework re-use in other projects. This follows the same approach e.g. the Linux Kernel took over the last years.

The problem that the SPDX markers address is the following: When publishing source code under an open source license, each source code file shall explicitly state the license it is released with. The usual way this is done is that a developer copies a license header text from the KDE licensing policies wiki, from another source file, or from somewhere else from the internet and puts it at the top of their newly created source code file. Thus the result is that today we have many slightly different license headers all over our frameworks source files (even if they only differ in formatting). Yet, these small differences make it very hard to introduce automatic checks for the source code licenses in terms of static analysis. This problem becomes even more urgent when one wants to check that a library, which consists of several source files with different licenses, does only contain compatible licenses.

The SPDX headers solve this problem by introducing a standardized language that annotates every source code file with license information in the SPDX syntax. This syntax is rich enough to express all of our existing license information and it can also cover more complicated cases like e.g. dual-licensed source files. For example, an “LGPL 2.1 or any later version” license header of a source file looks as:

// SPDX-License-Identifier: LGPL-2.1-or-later

The full list of all existing SPDX markers are available in the SPDX license registry.

The first step now is to define how to handle the GPL and LGPL license headers with specific KDE mentioning, as their is no direct equivalent in the SPDX registry. This is a question we are about to discuss with OSI. After deciding that we have to discuss in the KDE community if SPDX is the way to go (gladly, there was no objection yet to my mail to the community list) and to adapt our KDE licensing policy. And the final big step then will be to get the tooling ready for checking all existing licenses headers and to replace them (after review) with SPDX markers.

PS: Many thanks to MBition for the great planning location for the KF6 sprint in the MBition offices and to the KDE e.V. for the travel support!