The Poor Man's Web Dictionary

It is possible to create usable web-based multimedia dictionaries using pure HTML. If the dictionary is small enough, this can be done entirely by hand. If it is larger, it becomes tedious to generate by hand, but the programming necessary to generate it from a database is quite simple.

I here provide a demonstration of these statements by providing the code for a simple set of programs that generate a web-based lexicon from a database as well as a sample of the result. The programs take as input a dictionary database in a simple format and generates from it a web-based dictionary. It is not intended to compete with more sophisticated approaches, such as Kirrkirr or HyperLex, but rather to demonstrate how one can generate a usable web-based dictionary using only the most trivial computer programming. It does not provide sophisticated search tools, and it makes no attempt to handle exotic writing systems. However, it is more than a theoretical demonstration. For languages written in ASCII characters, for lexica of only a few thousand words, where fancy searches are not required, the dictionaries it generates are perfectly usable.

It is generally easiest to download all of the files at once, in which case you will get a compressed tar archive. Download pmwd.tgz If you have GNU tar, you can decompress and unpack this by giving the single command:

tar xzf pmwd.tgz

If you have a version of tar that does not know how to decompress such archives, you will have to decompress first using gunzip. Then use tar without the z flag to unpack. On Microsoft Windows systems I am told that WinZip can unpack compressed tar archives.

If for some reason you cannot deal with the compressed tar archive, you can also download the files individually. See the descriptions of the files below. The entry for each file contains a link allowing you to download it.

The files provided include a sample dictionary. To look at it, open the file sdtop.htm in your browser. The browser window will be divided into two parts known as frames The upper frame, which will occupy most of the browser window, will contain the index to the dictionary, that is, a list of the words in alphabetical order. You can use the scrollbar to show other parts of the list assuming that it is long enough that not all of it fits into the frame at once. Each of these words is a link. Clicking on a word will cause the definition to be displayed in the smaller frame at the bottom of the screen. Try clicking on tsachun. Notice that the definition is followed by the words "show picture". Click on them to see the picture. Now try clicking on hoonliz. Notice that the definition is followed by the words "play sound". Click on them to hear the word.

The lexical database is assumed to be in the format used by the Summer Institute of Linguistics Shoebox program since this is very widely used. Records are separated by blank lines. Each field begins with a backslash followed by the tag that identifies the field. The tag is followed by one or more spaces and then the contents of the field. The headword should be in a field with the tag head. The definition should be in a field with the tag def. Two optional fields are also used. The tag cat specifies the category of the word. The tag sci contains the scientific name of biological organisms. Your records may contain additional fields. The tag snd gives the name of a sound file containing the headword. The tag pic gives the name of an image file illustrating the headword. Your recoeds may contain additional fields. They will simply be ignored. Here is a sample of what such a file might look like:

\head duchun
\def tree, stick, wood in general

\head tsachun
\cat N
\def cache for storing food in the form of a little cabin on posts
\pic tsachun.jpg

\head hoonliz
\cat N
\def skunk
\sci Mephitis mephitis
\snd hoonliz.wav

The software is most easily run on a GNU/Linux system. If you have such a system with msort installed, all you need to do is make a copy of your database file named lexicon.ldb, edit the file language so that it contains the name of your language, and type make. The make program will then follow the instructions in the file makefile and generate the HTML files.

The HTML files generated are:

To use these files, just open dtop.htm in your browser.

If you do not have msort but can get your lexicon into the desired order in some other way, after renaming your lexicon database lexicon.ldb, make a copy of it called lexicon.srt. Then give the command:

touch lexicon.srt

This will make it look like lexicon.srt was created more recently than lexicon.ldb, so the make program will just use lexicon.srt instead of trying to generate it from lexicon.ldb.

If you do not have acces to make, you can just give the necessary commands by hand, assuming that you have awk:


  1. Rename your lexicon file lexicon.ldb.
  2. If you do not have msort but have sorted your lexicon file, make a copy called lexicon.srt and then give the command: touch lexicon.srt.
  3. Edit the file language so that it contains, on a single line, the name of your language.
  4. Give the command: awk -f makedtop.awk > dtop.htm
  5. Give the command: awk -f xform.awk < lexicon.ldb > lexicon.srt
  6. Give the command: awk -f gethead.awk < lexicon.srt > headers.srt
  7. Give the command: awk -f makeind.awk < headers.srt > index.htm
  8. Give the command: awk -f makedefs.awk < lexicon.srt > defs.htm

Depending on the kind of system you are using, you may have to go about executing awk differently. In the above, a filename following a less than sign is input to awk; a filename following a greater than sign is output from awk. Also recall that on some systems the newer version of awk is called nawk or gawk.

The bulk of the work is done by small programs written in AWK, a language. More information about AWK is available here.

The other piece of software that you need is a sorting program that is capable of sorting the lexicon database file. Many sorting programs cannot do this because they can only handle single lines. The program used here, msort, is my own sophisticated sorting program. The program and the manual can be downloaded from my web page. However, msort is only available for UNIX systems. If you are on a non-UNIX system, you will have to find some other way to get your lexicon file into the order desired.

Files Provided

Sample Files

hoonliz.wav
A sound file for use with the sample lexicon. The speaker is Mary John, Sr. OAC.

pmi51.htm
A file used for showing the image in tsachun.jpg.

sdefs.htm
The sample file containing the dictionary definitions and other content.

sdtop.htm
The sample frame definition file.

sindex.htm
The sample file containing the list of words in alphabetical order.

slexicon.ldb
This is a small lexicon file in the correct format that you can use to try out the the software. Just make a copy of it named lexicon.ldb. The words in it are in the Saik'uz (Stony Creek) dialect of Carrier, an Athabaskan language of central interior British Columbia. This is the dialect spoken in the vicinity of Vanderhoof.

tsachun.jpg
An image file for use with the sample lexicon.

Program-Related Files

exclusions
This file is used by the sorting program msort. It tells msort to ignore hyphens in initial and medial position when sorting.

gethead.awk
The AWK program that extracts headwords from the sorted lexicon file and generates the list of sorted headwords headers.srt from which the index file index.htm is generated.

language
This file should contain a single line with the name of the language for which you are creating a dictionary. This information is used to create appopriate titles for the HTML files. You should edit this file to adapt it to your language.

makedefs.awk
The AWK program that generates the definition file defs.htm.

makedtop.awk
The AWK program that generates the frame definition file dtop.htm.

makefile
The file that tells the make program what to do. It specifies how one file depends on another and what commands need to be executed to create each piece of the system.

makeind.awk
The AWK program that generates the index file index.htm.

null.htm
This is a well-formed HTML file with no actual content. It is used as the initial content of the definition frame, so that until a word is chosen, nothing appears there.

xform.awk
The AWK program that converts the raw input into a form more amenable to further processing. It does two things. First, the Shoebox format uses backslash as a field separator. This is a major pain in the neck in the UNIX world, where backslash has a special meaning to many programs. Rather than having to worry about backslashes over and over again, it is easiest just to translate the backslashes at the outset into a character that is less troublesome. In this case backslashes are translated into percent signs since that is an unproblematic character to deal with and is not very likely to occur otherwise in dictionary files. Second, since most people use Microsoft Windows systems, there is a good chance that lexicon files will come from such systems. Microsoft Windows (and DOS) use a sequence of two characters to end lines, namely a carriage return and a newline (reflecting the fact that these were seprate operations on teletype machines). UNIX systems use just a newline character for this purpose. The carriage returns look, to UNIX programs, like any other character, and so gum up the works. For example, the blank lines that separate records will not be blank because they will each contain a carriage return. So, as a precautionary measure, this program removes carriage return characters. If they have already been removed, this won't hurt anything. (Note to Macintosh users: on the Macintosh, the end of lines is marked by a single character, but it is carriage return, not newline. If you find that you are having problems, it may be because the way in which you transferred files to a UNIX system did not automatically translate carriage returns into newlines. When the carriage returns are stripped out,the effect is to make your entire file into a single huge line, which is not what you want. You can solve this problem by changing xform.awk so that it converts carriage returns into newline characters instead of deleting them. A line that does this is included in the file, commented out. Just remove the crosshatch comment character from the beginning of the line and put one at the beginning of the line that deletes the carriage returns to comment that line out.)