Create a Word List

Use UNIX Commands to Create an Alphabetized Word List from a Plain Text File

Last updated on Wed, Jul 14, 2021 36 min read UNIX, BASH, foreign language

Introduction

Vocabulary lists are a useful way to identify words of unknown meaning when reading a document. In particular, students of a foreign language may find a vocabulary list useful when preparing to read a literary work. A comprehensive vocabulary list can be quickly created from a plain text file on a computer by using a series of UNIX shell commands. The commands can even generate a list that is alphabetized and non-redundant. Herein I will demonstrate the use of UNIX shell commands to derive such a vocabulary list of Spanish words from the literary work Don Quijote, written by Miguel de Cervantes Saavedra.

Download Plain Text of Cervantes' Don Quijote

First, the book is downloaded in plain text format from its location on the Project Gutenberg website. It is saved locally in the working directory as the file CervantesDonQuijote.txt. The command wget downloads files from the world wide web via http, https, or ftp, and by using the -O parameter, we may specify the name of the file to be saved in the working directory of our computer.

wget https://www.gutenberg.org/files/2000/2000-0.txt -O CervantesDonQuijote.txt

## --2021-07-14 00:07:14--  https://www.gutenberg.org/files/2000/2000-0.txt
## Resolving www.gutenberg.org (www.gutenberg.org)... 2610:28:3090:3000:0:bad:cafe:47, 152.19.134.47
## Connecting to www.gutenberg.org (www.gutenberg.org)|2610:28:3090:3000:0:bad:cafe:47|:443... connected.
## HTTP request sent, awaiting response... 200 OK
## Length: 2226045 (2.1M) [text/plain]
## Saving to: ‘CervantesDonQuijote.txt’
## 
##      0K .......... .......... .......... .......... ..........  2%  445K 5s
##     50K .......... .......... .......... .......... ..........  4%  894K 3s
##    100K .......... .......... .......... .......... ..........  6% 22.1M 2s
##    150K .......... .......... .......... .......... ..........  9%  909K 2s
##    200K .......... .......... .......... .......... .......... 11% 37.8M 2s
##    250K .......... .......... .......... .......... .......... 13% 42.0M 1s
##    300K .......... .......... .......... .......... .......... 16% 57.6M 1s
##    350K .......... .......... .......... .......... .......... 18% 1.45M 1s
##    400K .......... .......... .......... .......... .......... 20% 1.57M 1s
##    450K .......... .......... .......... .......... .......... 23% 75.8M 1s
##    500K .......... .......... .......... .......... .......... 25% 66.7M 1s
##    550K .......... .......... .......... .......... .......... 27% 17.3M 1s
##    600K .......... .......... .......... .......... .......... 29% 79.1M 1s
##    650K .......... .......... .......... .......... .......... 32% 36.4M 1s
##    700K .......... .......... .......... .......... .......... 34% 73.4M 1s
##    750K .......... .......... .......... .......... .......... 36% 2.85M 1s
##    800K .......... .......... .......... .......... .......... 39% 1.63M 1s
##    850K .......... .......... .......... .......... .......... 41% 16.7M 0s
##    900K .......... .......... .......... .......... .......... 43% 80.9M 0s
##    950K .......... .......... .......... .......... .......... 46% 54.6M 0s
##   1000K .......... .......... .......... .......... .......... 48% 93.4M 0s
##   1050K .......... .......... .......... .......... .......... 50% 15.9M 0s
##   1100K .......... .......... .......... .......... .......... 52%  128M 0s
##   1150K .......... .......... .......... .......... .......... 55%  104M 0s
##   1200K .......... .......... .......... .......... .......... 57%  136M 0s
##   1250K .......... .......... .......... .......... .......... 59% 17.7M 0s
##   1300K .......... .......... .......... .......... .......... 62% 28.1M 0s
##   1350K .......... .......... .......... .......... .......... 64%  140M 0s
##   1400K .......... .......... .......... .......... .......... 66% 21.8M 0s
##   1450K .......... .......... .......... .......... .......... 69% 52.5M 0s
##   1500K .......... .......... .......... .......... .......... 71%  113M 0s
##   1550K .......... .......... .......... .......... .......... 73%  110M 0s
##   1600K .......... .......... .......... .......... .......... 75% 5.88M 0s
##   1650K .......... .......... .......... .......... .......... 78% 26.6M 0s
##   1700K .......... .......... .......... .......... .......... 80% 1.68M 0s
##   1750K .......... .......... .......... .......... .......... 82% 2.61M 0s
##   1800K .......... .......... .......... .......... .......... 85% 76.5M 0s
##   1850K .......... .......... .......... .......... .......... 87% 82.2M 0s
##   1900K .......... .......... .......... .......... .......... 89% 80.6M 0s
##   1950K .......... .......... .......... .......... .......... 92% 79.9M 0s
##   2000K .......... .......... .......... .......... .......... 94% 84.5M 0s
##   2050K .......... .......... .......... .......... .......... 96% 98.9M 0s
##   2100K .......... .......... .......... .......... .......... 98% 92.3M 0s
##   2150K .......... .......... ...                             100% 94.9M=0.4s
## 
## 2021-07-14 00:07:15 (4.96 MB/s) - ‘CervantesDonQuijote.txt’ saved [2226045/2226045]

Identify the Gutenberg Header and Footer for Removal

Project Gutenberg ebooks contain a header that includes identifying information on the ebook such as title, author, and language. Following the body of the literary work, there is a footer that contains the Project Gutenberg License. As the header and footer are extraneous to the literary work, the lines on which they occur will be identified in order to later remove them and leave only the text of Don Quijote.

The UNIX shell command cat, short for "concatenate", reads the contents of a file and outputs them, and as its name implies may be used to concatenate multiple files. In our case, we will only use it to read and output our text file. The UNIX shell command head displays a set number of lines from the beginning, or "head", of a file, whereas the command tail does the same, but at the end, or "tail", of a file. The pipe character | takes the output of a command and uses it as input for the subsequent command, enabling a chain of commands to perform operations on a single data source without having to save and reload the new version of the data that results from each command.

The -n parameter of the head and tail commands specifies the number of lines to output, and one has to progressively increase the number of lines until one can see that the entirety of the header or footer is displayed, then note the line numbers where the beginning and end of the literary text occur.

# Find the header.
cat CervantesDonQuijote.txt --number | head -n 50

# Find the footer.
cat CervantesDonQuijote.txt --number | tail -n 400

##      1   The Project Gutenberg eBook of Don Quijote, by Miguel de Cervantes Saavedra
##      2   
##      3   This eBook is for the use of anyone anywhere in the United States and
##      4   most other parts of the world at no cost and with almost no restrictions
##      5   whatsoever. You may copy it, give it away or re-use it under the terms
##      6   of the Project Gutenberg License included with this eBook or online at
##      7   www.gutenberg.org. If you are not located in the United States, you
##      8   will have to check the laws of the country where you are located before
##      9   using this eBook.
##     10   
##     11   Title: Don Quijote
##     12   
##     13   Author: Miguel de Cervantes Saavedra
##     14   
##     15   Release Date: December, 1999 [eBook #2000]
##     16   [Most recently updated: January 2, 2020]
##     17   
##     18   Language: Spanish
##     19   
##     20   Character set encoding: UTF-8
##     21   
##     22   Produced by: an anonymous Project Gutenberg volunteer and Joaquin Cuenca Abela
##     23   
##     24   *** START OF THE PROJECT GUTENBERG EBOOK DON QUIJOTE ***
##     25   
##     26   
##     27   
##     28   
##     29   El ingenioso hidalgo don Quijote de la Mancha
##     30   
##     31   
##     32   
##     33   por Miguel de Cervantes Saavedra
##     34   
##     35   
##     36   
##     37   
##     38   
##     39   El ingenioso hidalgo don Quijote de la Mancha
##     40   
##     41   
##     42     
##     43   Tasa
##     44   
##     45     
##     46   Testimonio de las erratas
##     47   
##     48     
##     49   El Rey
##     50   
##  37663   que acreditó su ventura
##  37664   morir cuerdo y vivir loco.
##  37665   
##  37666   Y el prudentísimo Cide Hamete dijo a su pluma:
##  37667   
##  37668   — Aquí quedarás, colgada desta espetera y deste hilo de alambre, ni sé si
##  37669   bien cortada o mal tajada péñola mía, adonde vivirás luengos siglos, si
##  37670   presuntuosos y malandrines historiadores no te descuelgan para profanarte.
##  37671   Pero, antes que a ti lleguen, les puedes advertir, y decirles en el mejor
##  37672   modo que pudieres:
##  37673   
##  37674   ''¡Tate, tate, folloncicos!
##  37675   De ninguno sea tocada;
##  37676   porque esta impresa, buen rey,
##  37677   para mí estaba guardada.
##  37678   
##  37679   Para mí sola nació don Quijote, y yo para él; él supo obrar y yo escribir;
##  37680   solos los dos somos para en uno, a despecho y pesar del escritor fingido y
##  37681   tordesillesco que se atrevió, o se ha de atrever, a escribir con pluma de
##  37682   avestruz grosera y mal deliñada las hazañas de mi valeroso caballero,
##  37683   porque no es carga de sus hombros ni asunto de su resfriado ingenio; a
##  37684   quien advertirás, si acaso llegas a conocerle, que deje reposar en la
##  37685   sepultura los cansados y ya podridos huesos de don Quijote, y no le quiera
##  37686   llevar, contra todos los fueros de la muerte, a Castilla la Vieja,
##  37687   haciéndole salir de la fuesa donde real y verdaderamente yace tendido de
##  37688   largo a largo, imposibilitado de hacer tercera jornada y salida nueva; que,
##  37689   para hacer burla de tantas como hicieron tantos andantes caballeros, bastan
##  37690   las dos que él hizo, tan a gusto y beneplácito de las gentes a cuya noticia
##  37691   llegaron, así en éstos como en los estraños reinos''. Y con esto cumplirás
##  37692   con tu cristiana profesión, aconsejando bien a quien mal te quiere, y yo
##  37693   quedaré satisfecho y ufano de haber sido el primero que gozó el fruto de
##  37694   sus escritos enteramente, como deseaba, pues no ha sido otro mi deseo que
##  37695   poner en aborrecimiento de los hombres las fingidas y disparatadas
##  37696   historias de los libros de caballerías, que, por las de mi verdadero don
##  37697   Quijote, van ya tropezando, y han de caer del todo, sin duda alguna. Vale.
##  37698   
##  37699   Fin
##  37700   
##  37701   
##  37702   
##  37703   
##  37704   *** END OF THE PROJECT GUTENBERG EBOOK DON QUIJOTE ***
##  37705   
##  37706   ***** This file should be named 2000-0.txt or 2000-0.zip *****
##  37707   This and all associated files of various formats will be found in:
##  37708       https://www.gutenberg.org/2/0/0/2000/
##  37709   
##  37710   Updated editions will replace the previous one--the old editions will
##  37711   be renamed.
##  37712   
##  37713   Creating the works from print editions not protected by U.S. copyright
##  37714   law means that no one owns a United States copyright in these works,
##  37715   so the Foundation (and you!) can copy and distribute it in the United
##  37716   States without permission and without paying copyright
##  37717   royalties. Special rules, set forth in the General Terms of Use part
##  37718   of this license, apply to copying and distributing Project
##  37719   Gutenberg-tm electronic works to protect the PROJECT GUTENBERG-tm
##  37720   concept and trademark. Project Gutenberg is a registered trademark,
##  37721   and may not be used if you charge for the eBooks, unless you receive
##  37722   specific permission. If you do not charge anything for copies of this
##  37723   eBook, complying with the rules is very easy. You may use this eBook
##  37724   for nearly any purpose such as creation of derivative works, reports,
##  37725   performances and research. They may be modified and printed and given
##  37726   away--you may do practically ANYTHING in the United States with eBooks
##  37727   not protected by U.S. copyright law. Redistribution is subject to the
##  37728   trademark license, especially commercial redistribution.
##  37729   
##  37730   START: FULL LICENSE
##  37731   
##  37732   THE FULL PROJECT GUTENBERG LICENSE
##  37733   PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK
##  37734   
##  37735   To protect the Project Gutenberg-tm mission of promoting the free
##  37736   distribution of electronic works, by using or distributing this work
##  37737   (or any other work associated in any way with the phrase "Project
##  37738   Gutenberg"), you agree to comply with all the terms of the Full
##  37739   Project Gutenberg-tm License available with this file or online at
##  37740   www.gutenberg.org/license.
##  37741   
##  37742   Section 1. General Terms of Use and Redistributing Project
##  37743   Gutenberg-tm electronic works
##  37744   
##  37745   1.A. By reading or using any part of this Project Gutenberg-tm
##  37746   electronic work, you indicate that you have read, understand, agree to
##  37747   and accept all the terms of this license and intellectual property
##  37748   (trademark/copyright) agreement. If you do not agree to abide by all
##  37749   the terms of this agreement, you must cease using and return or
##  37750   destroy all copies of Project Gutenberg-tm electronic works in your
##  37751   possession. If you paid a fee for obtaining a copy of or access to a
##  37752   Project Gutenberg-tm electronic work and you do not agree to be bound
##  37753   by the terms of this agreement, you may obtain a refund from the
##  37754   person or entity to whom you paid the fee as set forth in paragraph
##  37755   1.E.8.
##  37756   
##  37757   1.B. "Project Gutenberg" is a registered trademark. It may only be
##  37758   used on or associated in any way with an electronic work by people who
##  37759   agree to be bound by the terms of this agreement. There are a few
##  37760   things that you can do with most Project Gutenberg-tm electronic works
##  37761   even without complying with the full terms of this agreement. See
##  37762   paragraph 1.C below. There are a lot of things you can do with Project
##  37763   Gutenberg-tm electronic works if you follow the terms of this
##  37764   agreement and help preserve free future access to Project Gutenberg-tm
##  37765   electronic works. See paragraph 1.E below.
##  37766   
##  37767   1.C. The Project Gutenberg Literary Archive Foundation ("the
##  37768   Foundation" or PGLAF), owns a compilation copyright in the collection
##  37769   of Project Gutenberg-tm electronic works. Nearly all the individual
##  37770   works in the collection are in the public domain in the United
##  37771   States. If an individual work is unprotected by copyright law in the
##  37772   United States and you are located in the United States, we do not
##  37773   claim a right to prevent you from copying, distributing, performing,
##  37774   displaying or creating derivative works based on the work as long as
##  37775   all references to Project Gutenberg are removed. Of course, we hope
##  37776   that you will support the Project Gutenberg-tm mission of promoting
##  37777   free access to electronic works by freely sharing Project Gutenberg-tm
##  37778   works in compliance with the terms of this agreement for keeping the
##  37779   Project Gutenberg-tm name associated with the work. You can easily
##  37780   comply with the terms of this agreement by keeping this work in the
##  37781   same format with its attached full Project Gutenberg-tm License when
##  37782   you share it without charge with others.
##  37783   
##  37784   1.D. The copyright laws of the place where you are located also govern
##  37785   what you can do with this work. Copyright laws in most countries are
##  37786   in a constant state of change. If you are outside the United States,
##  37787   check the laws of your country in addition to the terms of this
##  37788   agreement before downloading, copying, displaying, performing,
##  37789   distributing or creating derivative works based on this work or any
##  37790   other Project Gutenberg-tm work. The Foundation makes no
##  37791   representations concerning the copyright status of any work in any
##  37792   country outside the United States.
##  37793   
##  37794   1.E. Unless you have removed all references to Project Gutenberg:
##  37795   
##  37796   1.E.1. The following sentence, with active links to, or other
##  37797   immediate access to, the full Project Gutenberg-tm License must appear
##  37798   prominently whenever any copy of a Project Gutenberg-tm work (any work
##  37799   on which the phrase "Project Gutenberg" appears, or with which the
##  37800   phrase "Project Gutenberg" is associated) is accessed, displayed,
##  37801   performed, viewed, copied or distributed:
##  37802   
##  37803     This eBook is for the use of anyone anywhere in the United States and
##  37804     most other parts of the world at no cost and with almost no
##  37805     restrictions whatsoever. You may copy it, give it away or re-use it
##  37806     under the terms of the Project Gutenberg License included with this
##  37807     eBook or online at www.gutenberg.org. If you are not located in the
##  37808     United States, you will have to check the laws of the country where
##  37809     you are located before using this eBook.
##  37810   
##  37811   1.E.2. If an individual Project Gutenberg-tm electronic work is
##  37812   derived from texts not protected by U.S. copyright law (does not
##  37813   contain a notice indicating that it is posted with permission of the
##  37814   copyright holder), the work can be copied and distributed to anyone in
##  37815   the United States without paying any fees or charges. If you are
##  37816   redistributing or providing access to a work with the phrase "Project
##  37817   Gutenberg" associated with or appearing on the work, you must comply
##  37818   either with the requirements of paragraphs 1.E.1 through 1.E.7 or
##  37819   obtain permission for the use of the work and the Project Gutenberg-tm
##  37820   trademark as set forth in paragraphs 1.E.8 or 1.E.9.
##  37821   
##  37822   1.E.3. If an individual Project Gutenberg-tm electronic work is posted
##  37823   with the permission of the copyright holder, your use and distribution
##  37824   must comply with both paragraphs 1.E.1 through 1.E.7 and any
##  37825   additional terms imposed by the copyright holder. Additional terms
##  37826   will be linked to the Project Gutenberg-tm License for all works
##  37827   posted with the permission of the copyright holder found at the
##  37828   beginning of this work.
##  37829   
##  37830   1.E.4. Do not unlink or detach or remove the full Project Gutenberg-tm
##  37831   License terms from this work, or any files containing a part of this
##  37832   work or any other work associated with Project Gutenberg-tm.
##  37833   
##  37834   1.E.5. Do not copy, display, perform, distribute or redistribute this
##  37835   electronic work, or any part of this electronic work, without
##  37836   prominently displaying the sentence set forth in paragraph 1.E.1 with
##  37837   active links or immediate access to the full terms of the Project
##  37838   Gutenberg-tm License.
##  37839   
##  37840   1.E.6. You may convert to and distribute this work in any binary,
##  37841   compressed, marked up, nonproprietary or proprietary form, including
##  37842   any word processing or hypertext form. However, if you provide access
##  37843   to or distribute copies of a Project Gutenberg-tm work in a format
##  37844   other than "Plain Vanilla ASCII" or other format used in the official
##  37845   version posted on the official Project Gutenberg-tm web site
##  37846   (www.gutenberg.org), you must, at no additional cost, fee or expense
##  37847   to the user, provide a copy, a means of exporting a copy, or a means
##  37848   of obtaining a copy upon request, of the work in its original "Plain
##  37849   Vanilla ASCII" or other form. Any alternate format must include the
##  37850   full Project Gutenberg-tm License as specified in paragraph 1.E.1.
##  37851   
##  37852   1.E.7. Do not charge a fee for access to, viewing, displaying,
##  37853   performing, copying or distributing any Project Gutenberg-tm works
##  37854   unless you comply with paragraph 1.E.8 or 1.E.9.
##  37855   
##  37856   1.E.8. You may charge a reasonable fee for copies of or providing
##  37857   access to or distributing Project Gutenberg-tm electronic works
##  37858   provided that
##  37859   
##  37860   * You pay a royalty fee of 20% of the gross profits you derive from
##  37861     the use of Project Gutenberg-tm works calculated using the method
##  37862     you already use to calculate your applicable taxes. The fee is owed
##  37863     to the owner of the Project Gutenberg-tm trademark, but he has
##  37864     agreed to donate royalties under this paragraph to the Project
##  37865     Gutenberg Literary Archive Foundation. Royalty payments must be paid
##  37866     within 60 days following each date on which you prepare (or are
##  37867     legally required to prepare) your periodic tax returns. Royalty
##  37868     payments should be clearly marked as such and sent to the Project
##  37869     Gutenberg Literary Archive Foundation at the address specified in
##  37870     Section 4, "Information about donations to the Project Gutenberg
##  37871     Literary Archive Foundation."
##  37872   
##  37873   * You provide a full refund of any money paid by a user who notifies
##  37874     you in writing (or by e-mail) within 30 days of receipt that s/he
##  37875     does not agree to the terms of the full Project Gutenberg-tm
##  37876     License. You must require such a user to return or destroy all
##  37877     copies of the works possessed in a physical medium and discontinue
##  37878     all use of and all access to other copies of Project Gutenberg-tm
##  37879     works.
##  37880   
##  37881   * You provide, in accordance with paragraph 1.F.3, a full refund of
##  37882     any money paid for a work or a replacement copy, if a defect in the
##  37883     electronic work is discovered and reported to you within 90 days of
##  37884     receipt of the work.
##  37885   
##  37886   * You comply with all other terms of this agreement for free
##  37887     distribution of Project Gutenberg-tm works.
##  37888   
##  37889   1.E.9. If you wish to charge a fee or distribute a Project
##  37890   Gutenberg-tm electronic work or group of works on different terms than
##  37891   are set forth in this agreement, you must obtain permission in writing
##  37892   from both the Project Gutenberg Literary Archive Foundation and The
##  37893   Project Gutenberg Trademark LLC, the owner of the Project Gutenberg-tm
##  37894   trademark. Contact the Foundation as set forth in Section 3 below.
##  37895   
##  37896   1.F.
##  37897   
##  37898   1.F.1. Project Gutenberg volunteers and employees expend considerable
##  37899   effort to identify, do copyright research on, transcribe and proofread
##  37900   works not protected by U.S. copyright law in creating the Project
##  37901   Gutenberg-tm collection. Despite these efforts, Project Gutenberg-tm
##  37902   electronic works, and the medium on which they may be stored, may
##  37903   contain "Defects," such as, but not limited to, incomplete, inaccurate
##  37904   or corrupt data, transcription errors, a copyright or other
##  37905   intellectual property infringement, a defective or damaged disk or
##  37906   other medium, a computer virus, or computer codes that damage or
##  37907   cannot be read by your equipment.
##  37908   
##  37909   1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except for the "Right
##  37910   of Replacement or Refund" described in paragraph 1.F.3, the Project
##  37911   Gutenberg Literary Archive Foundation, the owner of the Project
##  37912   Gutenberg-tm trademark, and any other party distributing a Project
##  37913   Gutenberg-tm electronic work under this agreement, disclaim all
##  37914   liability to you for damages, costs and expenses, including legal
##  37915   fees. YOU AGREE THAT YOU HAVE NO REMEDIES FOR NEGLIGENCE, STRICT
##  37916   LIABILITY, BREACH OF WARRANTY OR BREACH OF CONTRACT EXCEPT THOSE
##  37917   PROVIDED IN PARAGRAPH 1.F.3. YOU AGREE THAT THE FOUNDATION, THE
##  37918   TRADEMARK OWNER, AND ANY DISTRIBUTOR UNDER THIS AGREEMENT WILL NOT BE
##  37919   LIABLE TO YOU FOR ACTUAL, DIRECT, INDIRECT, CONSEQUENTIAL, PUNITIVE OR
##  37920   INCIDENTAL DAMAGES EVEN IF YOU GIVE NOTICE OF THE POSSIBILITY OF SUCH
##  37921   DAMAGE.
##  37922   
##  37923   1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you discover a
##  37924   defect in this electronic work within 90 days of receiving it, you can
##  37925   receive a refund of the money (if any) you paid for it by sending a
##  37926   written explanation to the person you received the work from. If you
##  37927   received the work on a physical medium, you must return the medium
##  37928   with your written explanation. The person or entity that provided you
##  37929   with the defective work may elect to provide a replacement copy in
##  37930   lieu of a refund. If you received the work electronically, the person
##  37931   or entity providing it to you may choose to give you a second
##  37932   opportunity to receive the work electronically in lieu of a refund. If
##  37933   the second copy is also defective, you may demand a refund in writing
##  37934   without further opportunities to fix the problem.
##  37935   
##  37936   1.F.4. Except for the limited right of replacement or refund set forth
##  37937   in paragraph 1.F.3, this work is provided to you 'AS-IS', WITH NO
##  37938   OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT
##  37939   LIMITED TO WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.
##  37940   
##  37941   1.F.5. Some states do not allow disclaimers of certain implied
##  37942   warranties or the exclusion or limitation of certain types of
##  37943   damages. If any disclaimer or limitation set forth in this agreement
##  37944   violates the law of the state applicable to this agreement, the
##  37945   agreement shall be interpreted to make the maximum disclaimer or
##  37946   limitation permitted by the applicable state law. The invalidity or
##  37947   unenforceability of any provision of this agreement shall not void the
##  37948   remaining provisions.
##  37949   
##  37950   1.F.6. INDEMNITY - You agree to indemnify and hold the Foundation, the
##  37951   trademark owner, any agent or employee of the Foundation, anyone
##  37952   providing copies of Project Gutenberg-tm electronic works in
##  37953   accordance with this agreement, and any volunteers associated with the
##  37954   production, promotion and distribution of Project Gutenberg-tm
##  37955   electronic works, harmless from all liability, costs and expenses,
##  37956   including legal fees, that arise directly or indirectly from any of
##  37957   the following which you do or cause to occur: (a) distribution of this
##  37958   or any Project Gutenberg-tm work, (b) alteration, modification, or
##  37959   additions or deletions to any Project Gutenberg-tm work, and (c) any
##  37960   Defect you cause.
##  37961   
##  37962   Section 2. Information about the Mission of Project Gutenberg-tm
##  37963   
##  37964   Project Gutenberg-tm is synonymous with the free distribution of
##  37965   electronic works in formats readable by the widest variety of
##  37966   computers including obsolete, old, middle-aged and new computers. It
##  37967   exists because of the efforts of hundreds of volunteers and donations
##  37968   from people in all walks of life.
##  37969   
##  37970   Volunteers and financial support to provide volunteers with the
##  37971   assistance they need are critical to reaching Project Gutenberg-tm's
##  37972   goals and ensuring that the Project Gutenberg-tm collection will
##  37973   remain freely available for generations to come. In 2001, the Project
##  37974   Gutenberg Literary Archive Foundation was created to provide a secure
##  37975   and permanent future for Project Gutenberg-tm and future
##  37976   generations. To learn more about the Project Gutenberg Literary
##  37977   Archive Foundation and how your efforts and donations can help, see
##  37978   Sections 3 and 4 and the Foundation information page at
##  37979   www.gutenberg.org
##  37980   
##  37981   Section 3. Information about the Project Gutenberg Literary
##  37982   Archive Foundation
##  37983   
##  37984   The Project Gutenberg Literary Archive Foundation is a non profit
##  37985   501(c)(3) educational corporation organized under the laws of the
##  37986   state of Mississippi and granted tax exempt status by the Internal
##  37987   Revenue Service. The Foundation's EIN or federal tax identification
##  37988   number is 64-6221541. Contributions to the Project Gutenberg Literary
##  37989   Archive Foundation are tax deductible to the full extent permitted by
##  37990   U.S. federal laws and your state's laws.
##  37991   
##  37992   The Foundation's principal office is in Fairbanks, Alaska, with the
##  37993   mailing address: PO Box 750175, Fairbanks, AK 99775, but its
##  37994   volunteers and employees are scattered throughout numerous
##  37995   locations. Its business office is located at 809 North 1500 West, Salt
##  37996   Lake City, UT 84116, (801) 596-1887. Email contact links and up to
##  37997   date contact information can be found at the Foundation's web site and
##  37998   official page at www.gutenberg.org/contact
##  37999   
##  38000   For additional contact information:
##  38001   
##  38002       Dr. Gregory B. Newby
##  38003       Chief Executive and Director
##  38004       gbnewby@pglaf.org
##  38005   
##  38006   Section 4. Information about Donations to the Project Gutenberg
##  38007   Literary Archive Foundation
##  38008   
##  38009   Project Gutenberg-tm depends upon and cannot survive without wide
##  38010   spread public support and donations to carry out its mission of
##  38011   increasing the number of public domain and licensed works that can be
##  38012   freely distributed in machine readable form accessible by the widest
##  38013   array of equipment including outdated equipment. Many small donations
##  38014   ($1 to $5,000) are particularly important to maintaining tax exempt
##  38015   status with the IRS.
##  38016   
##  38017   The Foundation is committed to complying with the laws regulating
##  38018   charities and charitable donations in all 50 states of the United
##  38019   States. Compliance requirements are not uniform and it takes a
##  38020   considerable effort, much paperwork and many fees to meet and keep up
##  38021   with these requirements. We do not solicit donations in locations
##  38022   where we have not received written confirmation of compliance. To SEND
##  38023   DONATIONS or determine the status of compliance for any particular
##  38024   state visit www.gutenberg.org/donate
##  38025   
##  38026   While we cannot and do not solicit contributions from states where we
##  38027   have not met the solicitation requirements, we know of no prohibition
##  38028   against accepting unsolicited donations from donors in such states who
##  38029   approach us with offers to donate.
##  38030   
##  38031   International donations are gratefully accepted, but we cannot make
##  38032   any statements concerning tax treatment of donations received from
##  38033   outside the United States. U.S. laws alone swamp our small staff.
##  38034   
##  38035   Please check the Project Gutenberg Web pages for current donation
##  38036   methods and addresses. Donations are accepted in a number of other
##  38037   ways including checks, online payments and credit card donations. To
##  38038   donate, please visit: www.gutenberg.org/donate
##  38039   
##  38040   Section 5. General Information About Project Gutenberg-tm electronic works.
##  38041   
##  38042   Professor Michael S. Hart was the originator of the Project
##  38043   Gutenberg-tm concept of a library of electronic works that could be
##  38044   freely shared with anyone. For forty years, he produced and
##  38045   distributed Project Gutenberg-tm eBooks with only a loose network of
##  38046   volunteer support.
##  38047   
##  38048   Project Gutenberg-tm eBooks are often created from several printed
##  38049   editions, all of which are confirmed as not protected by copyright in
##  38050   the U.S. unless a copyright notice is included. Thus, we do not
##  38051   necessarily keep eBooks in compliance with any particular paper
##  38052   edition.
##  38053   
##  38054   Most people start at our Web site which has the main PG search
##  38055   facility: www.gutenberg.org
##  38056   
##  38057   This Web site includes information about Project Gutenberg-tm,
##  38058   including how to make donations to the Project Gutenberg Literary
##  38059   Archive Foundation, how to help produce our new eBooks, and how to
##  38060   subscribe to our email newsletter to hear about new eBooks.
##  38061   
##  38062

Reading the line numbers that are displayed at the left of the output, the text of the book begins on line 29 and ends on line 37699.

List Characters That Occur in the Text

In processing the text to create the vocabulary list, we will need to delete punctuation, numbers, and other characters that are not letters of the alphabet that occur in the vocabulary words themselves. Otherwise, when punctuation occurs adjacent to a word, then we could end up with multiple occurrences of the word, differing by the adjacent puctuation, in our vocabulary list, such as occurs in the following:

abajo

abajo,

abajo;

abajo.

whereby we can see that 4 instances of the word "abajo" are created because of the attached punctuation characters. Hence, punctuation should be eliminated from the text in order to create a non-redundant vocabulary list.

The following piped commands will list the unique characters in the text after standard punctuation has been deleted. The pipe begins with the sed command, short for stream editor, which can edit text. In our case, we use it to simply read the text file and exclude the Project Gutenberg header and footer as explained earlier. Then, the tr command, short for translate as it may be used to replace one character with another, is used to delete punctuation characters in the "punctuation character class", which is a pre-defined list of punctuation characters that is invoked with [:punct:]. The command grep, short for "global regular expression print", allows searching for pattern matches. In our case, the -o parameter, which from the help for the command is said to "show only the part of a line matching PATTERN", is combined with the REGEX wildcard ".", which represents any character except the newline character. When used in this way the grep command effectively breaks the text up into one character on each line of the output, which is then sorted with the sort command into alphabetical order, and then the uniq command reduces the list to only the unique characters that appear in the text (minus the punctuation characters that were deleted by the tr -d '[:punct:]' command of course).

sed -n '29,27699p' CervantesDonQuijote.txt | tr -d '[:punct:]' | grep -o . | sort | uniq

##  
## ¡
## ¿
## «
## »
## —
## 
## 0
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## a
## A
## á
## Á
## à
## b
## B
## c
## C
## d
## D
## e
## E
## é
## É
## f
## F
## g
## G
## h
## H
## i
## I
## í
## Í
## ï
## j
## J
## l
## L
## m
## M
## n
## N
## ñ
## Ñ
## o
## O
## ó
## Ó
## p
## P
## q
## Q
## r
## R
## s
## S
## t
## T
## u
## U
## ú
## Ú
## ü
## v
## V
## W
## x
## X
## y
## Y
## z
## Z

At the top of the output, we can see 5 characters that are not included in the punctuation character class, and thus were not removed:

¡ ¿ « » —

We'll need to specifically name these characters for deletion in order to make the vocabulary list.

Create An Alphabetical Dictionary of Don Quijote

Below, a piped series of commands creates an initial version of the non-redundant, alphabetized vocabulary list. It contains 10 commands:

As described above, the sed command reads the text file and outputs its contents while excluding the extraneous Project Gutenberg text by reading the file at the line right after the header ends and stopping at the line right before the footer begins.
The tr command replaces the space character " " with newlines indicated by \n, effectively separating each word onto its own line.
The sed command is again invoked to remove the ¡ character, which is unique to the Spanish text and will not be removed by deleting the punctuation character class '[:punct:]' later in the pipe. For some reason that I do not know, but that I nevertheless discovered through trial and error, attempting to delete this character using the tr command results in misconversion of the á character throughout words in the list.
The tr command is used to convert uppercase letters to lowercase, thus eliminating redundant occurrences of words due to capitalization of the first letter of the word at the beginning of sentences. Commands in UNIX are generally case-sensitive.
The tr command is used to delete digits, or numbers throughout.
The tr command is used to delete the punctuation characters «»—¿ that were identified as specific to this text and missed by the standard punctuation character class '[:punct:]'.
The tr command is used to delete the standard punctuation character class.
The tr command is used to delete occurrences of carriage returns with the escape sequence \r.
The sort command arranges te output alphabetically and at this point contains multiple occurrences of identical words.
The uniq command reduces multiple occurrences of the same word to a single instance.
Finally, the > indicates for the standard output to be re-directed to a file named CervantesDonQuijoteSpanishWordList.txt'. The>` will overwrite the file if it already exists but will create it if it doesn't already exist.

sed -n '29,37699p' CervantesDonQuijote.txt | tr " " "\n" | sed 's/¡//' | tr [:upper:] [:lower:] | tr -d '[:digit:]' | tr -d "«»—¿" | tr -d '[:punct:]' | tr -d '\r' | sort | uniq > CervantesDonQuijoteSpanishWordList.txt

Examine the First and Last 10 Words in the List

Let's use the head and tail commands to examine the first 10 and last 10 words in the list.

head CervantesDonQuijoteSpanishWordList.txt

tail CervantesDonQuijoteSpanishWordList.txt

## 
## a
## á
## abad
## abadejo
## abades
## abadesa
## abaja
## abajan
## abajarse
## zoroástrica
## zorra
## zorras
## zorruna
## zuecos
## zulema
## zumban
## zurdo
## zurrón
## zuzaban

This small sample of words in the list looks good. However, on deeper inspection of the list in a text editor, I identified the occurrence of Roman numerals, and these will need to be removed.

Identify and Remove Roman Numerals

Roman numerals consist of the characters "ivxlcm", and they appear in our list because the characters used to denote them are also used in regular words. To remove Roman numerals, we'll need to find the lines in the vocabulary list that consist solely of the characters "ivxlcm", and for this we'll use the grep command with a regular expression that specifies the word must begin with these characters, consist of one or more of the characters, and end with these characters. The following code produces a list of words consisting of the Roman numeral characters:

# This also grabs real words like "mi", "mil", "vi", vil", "civil".
grep "^[ivxlcm]*$" CervantesDonQuijoteSpanishWordList.txt

## 
## c
## civil
## i
## ii
## iii
## iv
## ix
## l
## li
## lii
## liii
## liv
## lix
## lv
## lvi
## lvii
## lviii
## lx
## lxi
## lxii
## lxiii
## lxiv
## lxix
## lxv
## lxvi
## lxvii
## lxviii
## lxx
## lxxi
## lxxii
## lxxiii
## lxxiv
## mi
## mil
## v
## vi
## vii
## viii
## vil
## x
## xi
## xii
## xiii
## xiv
## xix
## xl
## xli
## xlii
## xliii
## xliv
## xlix
## xlv
## xlvi
## xlvii
## xlviii
## xv
## xvi
## xvii
## xviii
## xx
## xxi
## xxii
## xxiii
## xxiv
## xxix
## xxv
## xxvi
## xxvii
## xxviii
## xxx
## xxxi
## xxxii
## xxxiii
## xxxiv
## xxxix
## xxxv
## xxxvi
## xxxvii
## xxxviii

Upon review of this list, we can see that the real Spanish words "mi", "mil", "vi", vil", and "civil" are captured by this regular expression. To examine lines of the literary text where a particular word occurs, the following grep command may be used with the -n parameter, which prints the line number on the left before printing the content of the line where the match was found.

grep -n " vil " CervantesDonQuijote.txt

## 15665:y de tan vil traje vestido. A lo cual el mozo, asiéndole fuertemente de las
## 16032:vuestro bajo y vil entendimiento que el cielo no os comunique el valor que
## 18984:aniquilarlas y ponerlas debajo de las más viles que de algún vil escudero

The real Spanish words may be excluded from this grep command by using a "negative lookahead", invoked with the ?! symbols. The -P parameter specifies to use the Perl form of regular expressions.

Lets see if the numbers add up; i.e. is the number of unique words in the first version of the text minus the number of roman numerals equal to the number of words remaining in the text once we have removed the roman numerals?

# This command produces the number of unique words in the first version of the text.
wc -l CervantesDonQuijoteSpanishWordList.txt

# Include a "negative lookahead" in the regular expression that excludes real words from the Roman numeral characters
# This command produces the number of occurrences of unique roman numerals in the text.
grep -P '(?!^civil$|^mi$|^mil$|^vil$|^vi$)^[ivxlcm]*$' CervantesDonQuijoteSpanishWordList.txt | wc -l

# This command produces the number of unique words remaining in the text once the unique roman numerals have been removed.
grep -v -P '(?!^civil$|^mi$|^mil$|^vil$|^vi$)^[ivxlcm]*$' CervantesDonQuijoteSpanishWordList.txt | wc -l

## 22974 CervantesDonQuijoteSpanishWordList.txt
## 75
## 22899

Save the Vocabulary List Without Roman Numerals

Now we'll save the final vocabulary list without the Roman numerals as the file CervantesDonQuijoteSpanishWordListNoRomanNumerals.txt. The grep command is used with the -v parameter, which outputs the inverse of the result. In our case, this will output all words in the first saved iteration of the vocabulary list except the Roman numerals.

grep -v -P '(?!^civil$|^mi$|^mil$|^vil$|^vi$)^[ivxlcm]*$' CervantesDonQuijoteSpanishWordList.txt > CervantesDonQuijoteSpanishWordListNoRomanNumerals.txt

How Many Unique Words Are in Don Quijote?

Finally, we may now answer the question "How many unique words are in Don Quijote?" by using the wc command, short for "word count", and specifying the -l parameter to count lines in the file. Since only one word appears on each line of the file, then the line count equals the word count.

wc -l CervantesDonQuijoteSpanishWordListNoRomanNumerals.txt

## 22899 CervantesDonQuijoteSpanishWordListNoRomanNumerals.txt

There are 22899 unique words in Don Quijote.

Summary

We have taken the text of the literary work Don Quijote by Miguel de Cervantes Saavedra and used a series of UNIX shell commands to process the text into a non-redundant, alphabetized vocabulary list.

Future Work

There are several possibilities for additional text mining using this literary work as a data source.

One possibility is to further reduce the vocabulary list into "lemmas", which are the dictionary or reference form of words. For example, the words "catch", "caught", "catched", and "catching" are all forms of "catch", and hence all occurrences of these words may be reduced to the single lemma "catch" that encompasses the general meaning of the four word forms.

Another possibility is to perform frequency analysis, whereby one may produce a list of the words that recur most often througout the book, and a word cloud may be used to create a visual representation of the highest frequency words.

The frequency of occurrence of multiple words, or word n-grams, may be analyzed rather than the frequency of single words alone. N-grams are a contiguous sequence of n words, where n is a number. A 2-gram, for example, is also called a bigram (e.g. "full moon" or "rainy day") and a 3-gram is called a trigram (e.g. "simple but elegant"), and so on. There are a whole suite of concepts to refer to the various types of multi-word expressions (MWEs) including collocations, verbal idioms, frozen adverbials, partical verbs, complex nominals, etc. N-gram analysis and other more sophisticated analyses may be employed to extract MWEs, although a larger body of literary works may be needed in order to identify less common MWEs.

Finally, if one already has a list of known vocabulary words, then one may remove the list of one's personal vocabulary from the vocabulary list derived from the literary work, therby leaving only the unknown or unfamiliar words to remain for study or memorization. Alternatively, the words in common between two literary works may be identified. As yet another example, the words found in a body of text but not present in a dictionary may identify slang or words otherwise missing from the dictionary. These tasks effectively involve set functions such the union, intersection, and complement.

These ideas may be explored in subsequent blog posts.

UNIX BASH foreign language Spanish language learning word list vocabulary list literature text mining