Create a Word List

Use UNIX Commands to Create an Alphabetized Word List from a Plain Text File

Introduction

Vocabulary lists are a useful way to identify words of unknown meaning when reading a document. In particular, students of a foreign language may find a vocabulary list useful when preparing to read a literary work. A comprehensive vocabulary list can be quickly created from a plain text file on a computer by using a series of UNIX shell commands. The commands can even generate a list that is alphabetized and non-redundant. Herein I will demonstrate the use of UNIX shell commands to derive such a vocabulary list of Spanish words from the literary work Don Quijote, written by Miguel de Cervantes Saavedra.

Download Plain Text of Cervantes' Don Quijote

First, the book is downloaded in plain text format from its location on the Project Gutenberg website. It is saved locally in the working directory as the file CervantesDonQuijote.txt. The command wget downloads files from the world wide web via http, https, or ftp, and by using the -O parameter, we may specify the name of the file to be saved in the working directory of our computer.

wget https://www.gutenberg.org/files/2000/2000-0.txt -O CervantesDonQuijote.txt
## --2021-07-14 00:07:14--  https://www.gutenberg.org/files/2000/2000-0.txt
## Resolving www.gutenberg.org (www.gutenberg.org)... 2610:28:3090:3000:0:bad:cafe:47, 152.19.134.47
## Connecting to www.gutenberg.org (www.gutenberg.org)|2610:28:3090:3000:0:bad:cafe:47|:443... connected.
## HTTP request sent, awaiting response... 200 OK
## Length: 2226045 (2.1M) [text/plain]
## Saving to: ‘CervantesDonQuijote.txt’
## 
##      0K .......... .......... .......... .......... ..........  2%  445K 5s
##     50K .......... .......... .......... .......... ..........  4%  894K 3s
##    100K .......... .......... .......... .......... ..........  6% 22.1M 2s
##    150K .......... .......... .......... .......... ..........  9%  909K 2s
##    200K .......... .......... .......... .......... .......... 11% 37.8M 2s
##    250K .......... .......... .......... .......... .......... 13% 42.0M 1s
##    300K .......... .......... .......... .......... .......... 16% 57.6M 1s
##    350K .......... .......... .......... .......... .......... 18% 1.45M 1s
##    400K .......... .......... .......... .......... .......... 20% 1.57M 1s
##    450K .......... .......... .......... .......... .......... 23% 75.8M 1s
##    500K .......... .......... .......... .......... .......... 25% 66.7M 1s
##    550K .......... .......... .......... .......... .......... 27% 17.3M 1s
##    600K .......... .......... .......... .......... .......... 29% 79.1M 1s
##    650K .......... .......... .......... .......... .......... 32% 36.4M 1s
##    700K .......... .......... .......... .......... .......... 34% 73.4M 1s
##    750K .......... .......... .......... .......... .......... 36% 2.85M 1s
##    800K .......... .......... .......... .......... .......... 39% 1.63M 1s
##    850K .......... .......... .......... .......... .......... 41% 16.7M 0s
##    900K .......... .......... .......... .......... .......... 43% 80.9M 0s
##    950K .......... .......... .......... .......... .......... 46% 54.6M 0s
##   1000K .......... .......... .......... .......... .......... 48% 93.4M 0s
##   1050K .......... .......... .......... .......... .......... 50% 15.9M 0s
##   1100K .......... .......... .......... .......... .......... 52%  128M 0s
##   1150K .......... .......... .......... .......... .......... 55%  104M 0s
##   1200K .......... .......... .......... .......... .......... 57%  136M 0s
##   1250K .......... .......... .......... .......... .......... 59% 17.7M 0s
##   1300K .......... .......... .......... .......... .......... 62% 28.1M 0s
##   1350K .......... .......... .......... .......... .......... 64%  140M 0s
##   1400K .......... .......... .......... .......... .......... 66% 21.8M 0s
##   1450K .......... .......... .......... .......... .......... 69% 52.5M 0s
##   1500K .......... .......... .......... .......... .......... 71%  113M 0s
##   1550K .......... .......... .......... .......... .......... 73%  110M 0s
##   1600K .......... .......... .......... .......... .......... 75% 5.88M 0s
##   1650K .......... .......... .......... .......... .......... 78% 26.6M 0s
##   1700K .......... .......... .......... .......... .......... 80% 1.68M 0s
##   1750K .......... .......... .......... .......... .......... 82% 2.61M 0s
##   1800K .......... .......... .......... .......... .......... 85% 76.5M 0s
##   1850K .......... .......... .......... .......... .......... 87% 82.2M 0s
##   1900K .......... .......... .......... .......... .......... 89% 80.6M 0s
##   1950K .......... .......... .......... .......... .......... 92% 79.9M 0s
##   2000K .......... .......... .......... .......... .......... 94% 84.5M 0s
##   2050K .......... .......... .......... .......... .......... 96% 98.9M 0s
##   2100K .......... .......... .......... .......... .......... 98% 92.3M 0s
##   2150K .......... .......... ...                             100% 94.9M=0.4s
## 
## 2021-07-14 00:07:15 (4.96 MB/s) - ‘CervantesDonQuijote.txt’ saved [2226045/2226045]

List Characters That Occur in the Text

In processing the text to create the vocabulary list, we will need to delete punctuation, numbers, and other characters that are not letters of the alphabet that occur in the vocabulary words themselves. Otherwise, when punctuation occurs adjacent to a word, then we could end up with multiple occurrences of the word, differing by the adjacent puctuation, in our vocabulary list, such as occurs in the following:

abajo

abajo,

abajo;

abajo.

whereby we can see that 4 instances of the word "abajo" are created because of the attached punctuation characters. Hence, punctuation should be eliminated from the text in order to create a non-redundant vocabulary list.

The following piped commands will list the unique characters in the text after standard punctuation has been deleted. The pipe begins with the sed command, short for stream editor, which can edit text. In our case, we use it to simply read the text file and exclude the Project Gutenberg header and footer as explained earlier. Then, the tr command, short for translate as it may be used to replace one character with another, is used to delete punctuation characters in the "punctuation character class", which is a pre-defined list of punctuation characters that is invoked with [:punct:]. The command grep, short for "global regular expression print", allows searching for pattern matches. In our case, the -o parameter, which from the help for the command is said to "show only the part of a line matching PATTERN", is combined with the REGEX wildcard ".", which represents any character except the newline character. When used in this way the grep command effectively breaks the text up into one character on each line of the output, which is then sorted with the sort command into alphabetical order, and then the uniq command reduces the list to only the unique characters that appear in the text (minus the punctuation characters that were deleted by the tr -d '[:punct:]' command of course).

sed -n '29,27699p' CervantesDonQuijote.txt | tr -d '[:punct:]' | grep -o . | sort | uniq
##  
## ¡
## ¿
## «
## »
## —
## 
## 0
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## a
## A
## á
## Á
## à
## b
## B
## c
## C
## d
## D
## e
## E
## é
## É
## f
## F
## g
## G
## h
## H
## i
## I
## í
## Í
## ï
## j
## J
## l
## L
## m
## M
## n
## N
## ñ
## Ñ
## o
## O
## ó
## Ó
## p
## P
## q
## Q
## r
## R
## s
## S
## t
## T
## u
## U
## ú
## Ú
## ü
## v
## V
## W
## x
## X
## y
## Y
## z
## Z

At the top of the output, we can see 5 characters that are not included in the punctuation character class, and thus were not removed:

¡ ¿ « » —

We'll need to specifically name these characters for deletion in order to make the vocabulary list.

Create An Alphabetical Dictionary of Don Quijote

Below, a piped series of commands creates an initial version of the non-redundant, alphabetized vocabulary list. It contains 10 commands:

  1. As described above, the sed command reads the text file and outputs its contents while excluding the extraneous Project Gutenberg text by reading the file at the line right after the header ends and stopping at the line right before the footer begins.
  2. The tr command replaces the space character " " with newlines indicated by \n, effectively separating each word onto its own line.
  3. The sed command is again invoked to remove the ¡ character, which is unique to the Spanish text and will not be removed by deleting the punctuation character class '[:punct:]' later in the pipe. For some reason that I do not know, but that I nevertheless discovered through trial and error, attempting to delete this character using the tr command results in misconversion of the á character throughout words in the list.
  4. The tr command is used to convert uppercase letters to lowercase, thus eliminating redundant occurrences of words due to capitalization of the first letter of the word at the beginning of sentences. Commands in UNIX are generally case-sensitive.
  5. The tr command is used to delete digits, or numbers throughout.
  6. The tr command is used to delete the punctuation characters «»—¿ that were identified as specific to this text and missed by the standard punctuation character class '[:punct:]'.
  7. The tr command is used to delete the standard punctuation character class.
  8. The tr command is used to delete occurrences of carriage returns with the escape sequence \r.
  9. The sort command arranges te output alphabetically and at this point contains multiple occurrences of identical words.
  10. The uniq command reduces multiple occurrences of the same word to a single instance.
  11. Finally, the > indicates for the standard output to be re-directed to a file named CervantesDonQuijoteSpanishWordList.txt'. The>` will overwrite the file if it already exists but will create it if it doesn't already exist.
sed -n '29,37699p' CervantesDonQuijote.txt | tr " " "\n" | sed 's/¡//' | tr [:upper:] [:lower:] | tr -d '[:digit:]' | tr -d "«»—¿" | tr -d '[:punct:]' | tr -d '\r' | sort | uniq > CervantesDonQuijoteSpanishWordList.txt

Examine the First and Last 10 Words in the List

Let's use the head and tail commands to examine the first 10 and last 10 words in the list.

head CervantesDonQuijoteSpanishWordList.txt

tail CervantesDonQuijoteSpanishWordList.txt
## 
## a
## á
## abad
## abadejo
## abades
## abadesa
## abaja
## abajan
## abajarse
## zoroástrica
## zorra
## zorras
## zorruna
## zuecos
## zulema
## zumban
## zurdo
## zurrón
## zuzaban

This small sample of words in the list looks good. However, on deeper inspection of the list in a text editor, I identified the occurrence of Roman numerals, and these will need to be removed.

Identify and Remove Roman Numerals

Roman numerals consist of the characters "ivxlcm", and they appear in our list because the characters used to denote them are also used in regular words. To remove Roman numerals, we'll need to find the lines in the vocabulary list that consist solely of the characters "ivxlcm", and for this we'll use the grep command with a regular expression that specifies the word must begin with these characters, consist of one or more of the characters, and end with these characters. The following code produces a list of words consisting of the Roman numeral characters:

# This also grabs real words like "mi", "mil", "vi", vil", "civil".
grep "^[ivxlcm]*$" CervantesDonQuijoteSpanishWordList.txt 
## 
## c
## civil
## i
## ii
## iii
## iv
## ix
## l
## li
## lii
## liii
## liv
## lix
## lv
## lvi
## lvii
## lviii
## lx
## lxi
## lxii
## lxiii
## lxiv
## lxix
## lxv
## lxvi
## lxvii
## lxviii
## lxx
## lxxi
## lxxii
## lxxiii
## lxxiv
## mi
## mil
## v
## vi
## vii
## viii
## vil
## x
## xi
## xii
## xiii
## xiv
## xix
## xl
## xli
## xlii
## xliii
## xliv
## xlix
## xlv
## xlvi
## xlvii
## xlviii
## xv
## xvi
## xvii
## xviii
## xx
## xxi
## xxii
## xxiii
## xxiv
## xxix
## xxv
## xxvi
## xxvii
## xxviii
## xxx
## xxxi
## xxxii
## xxxiii
## xxxiv
## xxxix
## xxxv
## xxxvi
## xxxvii
## xxxviii

Upon review of this list, we can see that the real Spanish words "mi", "mil", "vi", vil", and "civil" are captured by this regular expression. To examine lines of the literary text where a particular word occurs, the following grep command may be used with the -n parameter, which prints the line number on the left before printing the content of the line where the match was found.

grep -n " vil " CervantesDonQuijote.txt
## 15665:y de tan vil traje vestido. A lo cual el mozo, asiéndole fuertemente de las
## 16032:vuestro bajo y vil entendimiento que el cielo no os comunique el valor que
## 18984:aniquilarlas y ponerlas debajo de las más viles que de algún vil escudero

The real Spanish words may be excluded from this grep command by using a "negative lookahead", invoked with the ?! symbols. The -P parameter specifies to use the Perl form of regular expressions.

Lets see if the numbers add up; i.e. is the number of unique words in the first version of the text minus the number of roman numerals equal to the number of words remaining in the text once we have removed the roman numerals?

# This command produces the number of unique words in the first version of the text.
wc -l CervantesDonQuijoteSpanishWordList.txt

# Include a "negative lookahead" in the regular expression that excludes real words from the Roman numeral characters
# This command produces the number of occurrences of unique roman numerals in the text.
grep -P '(?!^civil$|^mi$|^mil$|^vil$|^vi$)^[ivxlcm]*$' CervantesDonQuijoteSpanishWordList.txt | wc -l

# This command produces the number of unique words remaining in the text once the unique roman numerals have been removed.
grep -v -P '(?!^civil$|^mi$|^mil$|^vil$|^vi$)^[ivxlcm]*$' CervantesDonQuijoteSpanishWordList.txt | wc -l
## 22974 CervantesDonQuijoteSpanishWordList.txt
## 75
## 22899

Save the Vocabulary List Without Roman Numerals

Now we'll save the final vocabulary list without the Roman numerals as the file CervantesDonQuijoteSpanishWordListNoRomanNumerals.txt. The grep command is used with the -v parameter, which outputs the inverse of the result. In our case, this will output all words in the first saved iteration of the vocabulary list except the Roman numerals.

grep -v -P '(?!^civil$|^mi$|^mil$|^vil$|^vi$)^[ivxlcm]*$' CervantesDonQuijoteSpanishWordList.txt > CervantesDonQuijoteSpanishWordListNoRomanNumerals.txt

How Many Unique Words Are in Don Quijote?

Finally, we may now answer the question "How many unique words are in Don Quijote?" by using the wc command, short for "word count", and specifying the -l parameter to count lines in the file. Since only one word appears on each line of the file, then the line count equals the word count.

wc -l CervantesDonQuijoteSpanishWordListNoRomanNumerals.txt
## 22899 CervantesDonQuijoteSpanishWordListNoRomanNumerals.txt

There are 22899 unique words in Don Quijote.

Summary

We have taken the text of the literary work Don Quijote by Miguel de Cervantes Saavedra and used a series of UNIX shell commands to process the text into a non-redundant, alphabetized vocabulary list.

Future Work

There are several possibilities for additional text mining using this literary work as a data source.

One possibility is to further reduce the vocabulary list into "lemmas", which are the dictionary or reference form of words. For example, the words "catch", "caught", "catched", and "catching" are all forms of "catch", and hence all occurrences of these words may be reduced to the single lemma "catch" that encompasses the general meaning of the four word forms.

Another possibility is to perform frequency analysis, whereby one may produce a list of the words that recur most often througout the book, and a word cloud may be used to create a visual representation of the highest frequency words.

The frequency of occurrence of multiple words, or word n-grams, may be analyzed rather than the frequency of single words alone. N-grams are a contiguous sequence of n words, where n is a number. A 2-gram, for example, is also called a bigram (e.g. "full moon" or "rainy day") and a 3-gram is called a trigram (e.g. "simple but elegant"), and so on. There are a whole suite of concepts to refer to the various types of multi-word expressions (MWEs) including collocations, verbal idioms, frozen adverbials, partical verbs, complex nominals, etc. N-gram analysis and other more sophisticated analyses may be employed to extract MWEs, although a larger body of literary works may be needed in order to identify less common MWEs.

Finally, if one already has a list of known vocabulary words, then one may remove the list of one's personal vocabulary from the vocabulary list derived from the literary work, therby leaving only the unknown or unfamiliar words to remain for study or memorization. Alternatively, the words in common between two literary works may be identified. As yet another example, the words found in a body of text but not present in a dictionary may identify slang or words otherwise missing from the dictionary. These tasks effectively involve set functions such the union, intersection, and complement.

These ideas may be explored in subsequent blog posts.