Before word processors had a spell checker feature, you had to run your own spell checker on a document. And during the very first early days of Unix, the systems did not have a special “spell check” program but instead required a set of commands to do it themselves. Let’s take a look at how to check the spelling in the “old school Unix”; way.
Check spelling on the command line
Today we do not think about the spell checker in our word processor. You may not even “run” a spell checker anymore. It’s easier to see that the red twisted line appears under misspelled words; if there is a red line below it, fix the spelling.
In the early days of Unix, the system provided a dictionary file (usually)
/usr/share/dict/words on most Linux systems) which contained a sorted list of words from the dictionary, with each word in a row by itself. To check the spelling of a document, you need to compare all the words in your document with the dictionary. And to do that, you need to convert your document to a format that looks like the dictionary file: a sorted list of words, with each word in its own line.
The dictionary file is all lowercase, so first you need to convert your document to lowercase. You do this with
cat command to view the file and
tr command to translate characters from one set to another. In this case, you can ask
tr to convert all uppercase letters AZ to lowercase letters az:
cat document | tr A-Z a-z
While the dictionary contains punctuation in words such as hyphens and apostrophes, the list of words does not contain punctuation such as periods and question marks. So the next step is to use
tr, this time to delete (
-d) characters we do not want:
cat document | tr A-Z a-z | tr -d ',.:;()?!'
The dictionary file has each word on a line by itself, so you need to break up your document so that each word appears on its own line. The
tr command can replace spaces with a “new line” character to do this for us:
cat document | tr A-Z a-z | tr -d ',.:;()?!' | tr ' ' 'n'
Sorting of the output is easy with Unix
sort command. add
uniq command to clear the output, to delete all duplicate words. For example, you probably use the word “the” several times in any document. Using
uniq will remove the repeated occurrence of “the” so that you only have one “the” in your output.
cat document | tr A-Z a-z | tr -d ',.:;()?!' | tr ' ' 'n' | sort | uniq
Now you are ready to compare the list of words from your document with the dictionary file! The default Unix command
comm compares two files line by line and identifies lines that are unique to the first file, unique to the second file, or lines that are common to both. To find the list of misspelled words from your document, you want a list of unique words – words that are in your document, but not in the dictionary file. Used
-2 alternative to not print the words that are unique to the other file and
-3 alternative to not show the words that are common to both files. What is left are the words that are unique to your document that do not appear in the dictionary; these are misspelled words.
cat document | tr A-Z a-z | tr -d ',.:;()?!' | tr ' ' 'n' | sort | uniq | comm -2 -3 - /usr/share/dict/words
The only hyphen tells
comm to read from the “default entry”, which is the output of the previous commands on the command line.
And this is how you can check the spelling in the “old school Unix” way! Let me demonstrate with a test document. I have intentionally misspelled a few words here:
$cat document Early Unix didn't have word procesors like we thikn of them today. Instead, you wrote a plain text document that might have embedded special commands to underline text or create a list of bulet points. But how did you check the spelling of your document?
By running the list of commands, you will find this list of misspelled words:
$cat document | tr A-Z a-z | tr -d ',.:;()?!' | tr ' ' 'n' | sort | uniq | comm -2 -3 - words bulet procesors thikn
The key to checking spelling in this way is Unix
comm command to compare two sorted lists of words. The two lists need to be sorted in the same way. Your Linux system
/usr/share/dict/words the file may contain some uppercase letters as common names or titles or places. For example, the dictionary file on my Fedora 32 system contains both “Minnesota” (correct letters for the US state name) “minnesota” (all lowercase letters) on adjacent lines. Men Unix
sort command sorts uppercase letters separately from lowercase letters. This will confuse
comm command, which will complain that the input file is not sorted correctly. To better match the “old school Unix” method for checking spelling, you may first need to sort the system dictionary file and save it in a separate file. You can do this:
sort /usr/share/dict/words > words