Some weeks ago I wrote about SPDX identifiers and how they can be used to annotate source code files with machine readable license information. In this post, now I want to compile the things I learned after looking more deeply into this topic and how it might be applied to KDE.
SPDX identifiers are an important step in order to allow tools an automatic reading and checking of license information. However, as most norms, the SPDX specification is quite general, for many people cumbersome to read and allows many options on how to use the identifiers; while me as a developer, I just want to have a small howto that explains how I have to state my license inormation. Another point is that in my experience any source code annotation with machine readable information is pointless unless you have a tool that automatically checks the correctness. Otherwise, there is a big burden on code reviews that would have to check tiny syntactical requirements from a specification. — If you look deeply into the used license headers in KDE (I did this), there is a shocking number of different license statements that often state exactly the same. This might be due to different formatting or typos but also due to actual faults when trying to pick the correct license for a use case, which somehow got mixed up.
While doing research on best practices for applying machine readable license information, I was pointed to the REUSE.software initiative, which was started by the Free Software Foundation Europe (FSFE) to provide a set of recommendations to make licensing easier. What they provide is (in my opinion) a really good policy for how to uniformly state SPDX based license headers in source files, how to put license texts into the repository and a way to automatically check the syntactical correctness of the license statements with a small conformance testing tool.
I really like the simplicity of their approach, where I mean simplicity in the amount of documentation you have to read to understand how to do it correctly.
Meanwhile in KF5…
As already said, I want to see machine readable license information in KDE Frameworks in order to increase their quality and to make them easier to use outside of KDE. The first step to be done before introducing any system for machine readable identifiers is to understand what we have inside our repositories right now.
Disclaimer: I know that there are many license parsing tools out there in the wild and I know that several of them are even well established.
Yet, I looked into what we have inside our KF5 repositories and what has to be detected: Most of our licenses are GPL*, LGPL*, BSD-*-Clause, MIT or a GPL/LGPL variant with the “any later version accepted by the membership of KDE e.V. (or its successor approved by the membership of KDE e.V.), which shall act as a proxy […] of the license.” addition. After a lot of reasoning, I came to the result that for the specific use case of detecting the license headers inside KDE project (even focused only on frameworks right now) it makes most sense to have a small tool only working for this use case. The biggest argument for me was that we need a way to deal with the many historic license statements from up to 20 years ago.
Thus, I started a small tool in a scratch repository, named it licensedigger and started the adventure to parse all license headers of KDE Frameworks. From all source files right now I am done with about 90%. Yet, I neglected to look into KHTML, KJS, KDE4LibsSupport and the other porting aid frameworks until now. Specifically for the Tier 1 and Tier 2 frameworks I am mostly done and even beasts like KIO can be detected correctly right now. Thus, I am still working on it to increase the number of headers to be detected.
The approach I took is the following:
- For every combination of licenses there is one regular expression (which only has the task to remove whitespace and any “*” characters).
- For every license check there is a unit test consisting of a plaintext license header and a original source code file that guarantees that the header is found.
- Licenses are internally handled with SPDX markers.
- For a new license or a license header statement for an already contained license, the license name must be stated multiple times to ensure that copy-past errors with licenses are minimized.
- It is OK if the tool only detects ~95% of the license headers, marks unknown headers clearly and requires that the remaining 2-3 files per repository have to be identified by hand.
At the moment, the tool can be run to provide a list of license for any folder structure, e.g. pointing it to “/opt/kde/src/frameworks/attica” or even on “/opt/kde/src/frameworks” will produce a long list of file names with their licenses. A next (yet simple) step will be to introduce a substitution mode that replaces the found headers with SPDX markers and further to add the license files in a way that is compatible with REUSE.
Please note that there was no discussion yet on the KDE mailing list if this (i.e. the REUSE way) is the right way to go. But I will send respective mails soon. This post is mostly meant to provide a little bit of background before starting a discussion, such that I can keep mails shorter.