There are a number of Unix utilities that allow one to do such things as break text files into pieces, combine text files together, extract bits of information from them, rearrange them, and transform their content. Taken together, these Unix tools provide a powerful system for obtaining linguistic information. Here is a brief summary of the relevant tools. In each case, the name of the program is a link to the manual page or other more detailed information.
It is often useful to know how much is in a file. This can help to determine whether it contains enough material to be worth the bother, whether it is so large as to require special handling, and whether it is in the expected or desired format. (For example, if a file has very few lines in comparison to the number of words or characters, it may come from system with different end-of-line conventions.)
Several utilities allow one to cut a file into pieces.
Note that head and tail used in combination allow one to extract any desired contiguous set of lines. For example, the command
head -20 | tail -5extracts lines 16 through 20.
Instead of cutting a file into pieces based purely on the position of the pieces, it is possible to extract material based on its content.
Given two or more files, it is possible to combine them into a single file either "horizontally" or "vertically", or on the basis of the contents of a particular field.
Most frequently we want to rearrange a file on the basis of the content of the pieces, for which we use sort. The standard sort program is very useful, but it is not capable of some of the kinds of sorting that arise in linguistic work. A more powerful sorting program designed specifically for linguistics is msort, which we will look at later when we deal with sorting in more detail. It is occasionally useful, however, to be able to reverse the order of the contents of a file, for which tac is available.
There are a variety of ways of transforming a file in a systematic way. These range from the specialized transformations provided by fold to the very general transformations provided by sed and awk.
fold -w 1will do the job. It sets the line length to one character. GNU fold understands Unicode.
awk '{print $3,$1}'will extract the first and third fields from every input line and print them in reverse order, that is, the third field followed by the first field. GNU awk understands Unicode.
In some cases it is possible to accomplish a task using a single tool, but often it is necessary, or at any rate easier and more efficient, to divide the task up among different tools. One of the strengths of Unix is the fact that it provides unusually good support for using tools together.
Some programs read and write files named on the command line. In this case, if you want one program to read the output of another, you have no choice but to have the second program read from the file created by the first program. However, many programs read from the standard input, abbreviated stdin and write to the standard output, abbreviated stdout. By default, these are associated with the terminal. A program that reads from the standard input will read what the user types at the keyboard; what a program writes on its standard output will appear on the terminal. Every process has three i/o streams opened for it automatically when it is created. Two of these are stdin and stdout; the third, which we will not discuss here, is the standard error output, abbreviated stderr. stderr is a second output stream. It is provided so that a program's main output can be kept separate from error messages or other commentary.
It is possible to redirect the three default i/o streams. The less than sign
reassociates stdin with a file; the greater than sign redirects stdout.
Thus, a program like wc that reads from stdin and writes on
stdout will read its input from file a and write on file b
if we use the following command line:
wc < a > b
If your work involves several stages, some of which take a significant amount of time, it is desirable not to have to redo more than necessary. If you just run your commands from the command line or assemble them into a shell script, whenever you change something you either have to rerun the entire process or you have to figure out which parts must be rerun and edit these out so that you can run them separately. This is tedious and error-prone.
Fortunately, there is an alternative, the make program. make executes the commands necessary to generate specified targets from the files on which they depend. make obtains its instructions from files known as makefiles. If you don't specify what file to use, make looks for a file named makefile in the current directory, then for a file named Makefile. A makefile may also be specified on the command line.
The makefile expresses dependencies among files
and indicates how to generate each file from those it depends on.
Here is a simple makefile:
text.u: text.can WeirdFont2Unicode < text.can > text.u text.can: text ReorderWeirdFont < text > text.can
This makefile contains two rules. The first rule says that the target text.u depends on text.can and that the former can be generated from the latter by executing the command WeirdFont2Unicode < text.can > text.u. The second rule says that the target text.can depends on the target text and that text.can can be generated from text by executing the command ReorderWeirdFont < text > text.can.
If we execute make with this makefile in a directory containing the file text but neither text.can nor text.u, make will automatically execute the necessary commands to create text.u. It does this by observing that in order to generate text.u it needs text.can. Since this does not exist, it looks to see if it knows how to create it. In fact, it does, since the second rule tells it how. So make first executes the command ReorderWeirdFont < text > text.can to create text.can, and then executes the command WeirdFont2Unicode < text.can > text.u to create text.u.
Now, suppose that text.can already existed when we ran make. We might think that make would use the existing text.can and just execute the command WeirdFont2Unicode < text.can > text.u to create text.u. In fact, this will happen only if text.can is more up to date than the files it depends on, namely text. If the modification time of text is more recent than that of text.can, make will decide that text.can is out of date and should be replaced. As a result, whenever you modify a file, you need only run make and it will rerun precisely those commands necessary to update the target.
A similar system that offers some advantages over make is makepp, which at present runs only on Unix systems. makepp is backwards compatible with make in the sense that it can use makefiles created for make, but it has additional capabilities. One difference that is particularly useful for linguistic work is that makepp rebuilds when the build rules are changed as well as when files change. When one is writing computer programs, it is typically the case that the relationship between the components is easy to set up and does not change very often. Most of the changes during the development process are in the programs themselves, that is, in the files that must be processed. So for the purposes for which make was originally developed, rebuilding only when files change makes sense. On the other hand, in linguistic data processing, the underlying files are typically data sources that do not change. What changes during development are the commands that process the data sources. Rebuilding automatically when these commands change is therefore a great convenience.
Another way in which separate programs can be made to work together is for one program to execute another. The first program is said to be the parent, the second to be a child of the first. Only some programs can do this, essentially those like awk and python that are programming languages. How to do this depends on the programming language and is not appropriate for discussion here. But you should keep this possibility in mind when using a programming language. To take a simple example, suppose that you need to sort some data. It is possible to write a sorting subroutine in awk, but it is much easier and more efficient to use sort or another specialized sorting program. If you need this sort in the midst of an awk program, you can run the sorting program as a child of awk.