Spellchecking Comments in the Linux Kernel

Linus Torvalds recently remarked:

Oh, and as a sign that 2.6.x really _is_ approaching, people have started sending me spelling fixes. Kernel coders are apparently all atrocious spellers, and for some reason the spelling police always comes out of the woodwork when stable releases get closer.

Therefore, to help the spelling police, I and a few others put together the following tools, tips, and data files to help those who wish to help fix spelling errors in the source of the Linux kernel.

A Cautionary Note

Fixing spelling errors in the main source tree has one known unpleasant side effect: it becomes harder to apply old patches. See Alan Cox's comment.

One possible way this could be addressed is for the spelling police to offer a patch update service. Here's how it might work:

Get a copy of the main tree from before the majority of spelling fixes (say, 2.5.63), call it kernel-orig
Copy that to kernel-spellfix, and apply the spellfix patches
For each patch you wish to update:
1. Make a working copy of kernel-orig
2. apply the patch
3. apply the spellfix patches
4. regenerate the patch with diff -Naur kernel-spellfix kernel
5. Verify the patch still applies

Doing this would require having on hand all the spellfix patches. That means we probably want to maintain the spellfix patches in a central location for future reference. Any volunteers? I'd be happy to link to them.

The patches that Dave Jones was concerned about were the "2.4 kernel commit archives". We might want to proactively try updating these. Once we have an archive of patches that need updating, we could probably automate the update process.

Only fix the howlers

The goal of a good spelling policeman should simply be to fix the typos so glaringly obvious that nobody could object. Do not fix puns. "Bork" and "borken" are perfectly good punny words, as is "dain bramage". If it looks like it might be funny to someone, leave it alone.

Both American and British English spellings are fine by this definition, as is jargon. In fact, if you can find a word in any dictionary at all, it's probably ok. (For instance, check dictionary.cambridge.org first, and onelook.com as a backup.)

Fixing spelling mistakes requires careful manual review

Correcting spelling mistakes is not as easy as it sounds; it cannot really be automated, as many reasonable-sounding corrections could actually change the meaning of the comments.

All changes must be carefully manually reviewed at some point, possibly by more than one person. Jared Smith wrote:

I have tried to automatically spell-check long, complex texts for years, with numerous algorithms; all of them fail for one reason or another, and I find that the only proper way to do it is the tedious work by hand.
Even a single lost pun because of overenthusiastic spellchecking is not worth the cleanup. I would prefer to see typos than lose a single intentional 'misspelling'. It would be best if you posted all changes somewhere so that they could be verified manually.

Avoid breaking the build!

One person submitted a cant -> can't "fix" that caused compilation errors, incurring the hot flaming wrath of a number of developes. Don't do that. (See Linus's response.)

To avoid submitting a spellfix that breaks real code, consider following these simple rules:

Only fix comments delimited by /* ... */ or //. (The spell-fix.pl tool listed below is careful to do that.)
Build after you fix. Don't submit anything that doesn't compile.
Don't fix any comments in code you can't build.
To be really safe, don't fix any punctuation at all.

Finding missspellings

There are several ways to identify spelling problems.

Pick an Error and Hunt It Down

In the case of the "loose -> lose" changes, several people picked a known common spelling mistake, and went looking for examples of this mistake in the sources. This is a good and careful way to attack the problem.

Batch Mode Spellcheckers

lspell.pl is a perl script that will use a standard linux spellchecker to list suspicious words in all the files you pass it on the commandline. It treats its first argument as a file containing stopwords, one per line. For each file in which it finds suspicious words not listed in the stopword file, it outputs a line containing the name of the file, a colon, and the number of suspicious words, a blank line, the suspicious words all on one line, and another blank line.

To generate a stopword file containing all the nonwords from the noncomment part of the kernel source, do

find linux -name '*.[ch]' | xargs perl lspell2.pl /dev/null | grep -v ':' | sort -f | uniq > stop1.txt

To generate a stopword file containing all nonwords from the comment part of the kernel source, spellcheck the entire kernel tree using the stopword file generated above:

find linux -name '*.[ch]' | xargs perl lspell.pl stop1.txt | grep -v ':' | sort -f | uniq > stop2.txt

then edit 'stop2.txt' and remove lines that are not obviously spelling errors.

Finally, generate a master stopword file by combining stop1.txt and stop2.txt:

sort -u stop[12].txt > stop.txt

Here's where to get the scripts and data files mentiond above:

lspell.pl - the spellchecker
lspell2.pl - a stopword file generator
stop.txt - a complete stopword file for the 2.5.63bk5 Linux kernel, generated as described above

Example output of lspell.pl:

linux-2.5.63-bk5.old/include/asm-s390x/atomic.h: 1

enviroment

linux-2.5.63-bk5.old/include/asm-s390x/rwsem.h: 1

consequtive

linux-2.5.63-bk5.old/include/asm-s390x/dasd.h: 3

featueres Perfomance requests's

linux-2.5.63-bk5.old/include/asm-s390x/pgtable.h: 3

lenght regiontable specifiation

You can use the output of this program and a little elbow grease to create a corrections file for the next program:

Batch Mode Spelling Fixers

spell-fix.pl is a perl script by Matthias Schniedermeyer which, given a specific list of known spelling errors and their correct spelling, corrects any instance of those errors in an entire directory hierarchy. Here's where to get it (though Matthias may have posted a newer copy to the linux-kernel list):

spell-fix.pl - the spell correcter
spell-fix.txt - a proposed list of corrections for the 2.6.0-pre kernel based on errors occurring in 3 or more kernel source files, plus updates from the spelling police squad; last updated 1 Sept 2003 by Steven Cole.
joinfix.pl - small script which reads correction file in one-per-line "correct=bad" format and outputs it in "correct=bad1,bad2,bad3..." format as required by spell-fix.pl
unjoinfix.pl - small script which reads correction file in "correct=bad1,bad2,bad3..." format as required by spell-fix.pl and outputs it in one-per-line "correct=bad" format required to merge correction files

typo.sh is a shell script by Francois Gouget which corrects a built-in list of common spelling mistakes. He wrote it to use with the Wine source tree, but it works ok on the kernel source tree as well. His post to lkml says the script is at fgouget.free.fr/typos. and kernel patches based on it are at fgouget.free.fr/tmp/linux-spelling/.

Stephane LOEUILLET also had a go at an automated typo fixer. Here's his script and his list of typos. See his first and his second posts to lkml on the subject.

Reviewing and Submitting Patches

Once you've corrected the spelling of a bunch of kernel source, the next step is to make a patch, review it carefully, and submit it to the linux-kernel mailing list.

Patches should be against a tree as close to Linus's BK tree as possible. One way to get a good reference tree is to download the latest released 2.5 kernel from www.kernel.org, and then patch it with the "gzipped full patch" from www.kernel.org/pub/linux/kernel/v2.5/testing/cset.

The patch turns out to be a very good place to review the proposed changes, since it shows a couple lines of context. If you don't like a proposed change, you can edit the patch to remove the hunk containing the change.

There is some debate about whether to submit a single patch for each kind of spelling error (e.g. "loose -> lose"), or a single patch for each area of the kernel source. Both approaches are probably good, but in any case, patches should be small and carefully reviewed by hand before submission.

Please have a look at earlier spelling patches accepted into Linus's tree. Linus seems to be applying patches that fix a single kind of spelling error (e.g. [PATCH] Spelling fixes: accommodate). Look at Linus's testing tree changeset page and search for "spelling".

Interactive Spellcheckers

I list this option last, as I haven't tried it.

International Ispell supports simple plug-in filters that let it spell-check just the portions of a document of interest, say the comments in a C program. ispell-c-comments.c is a filter that should be compatible with International Ispell; I haven't tried it myself yet.