msort is a program for sorting files in sophisticated ways. It was originally developed for alphabetizing dictionaries of "exotic" languages in formats like those used by Shoebox and Toolbox, for which it has been extensively used, but is useful for many other purposes. msort differs from typical sort utilities in providing greater flexibility in parsing the input into records and identifying key fields and greater control over the sort order. Its main distinctive features are:
msort understands UTF-8 Unicode. Unicode may be used anywhere that text is entered: in the text to be sorted, in sort order and exclusion definitions, as a field or record separator, or as a field tag. Full Unicode case-folding is available.
Review by Ben Martin at linux.com
(上の日本語訳)
If you are looking for the specialized Hungarian sort program also called msort, try here.
Msort's capabilities are very close to a superset of those of GNU sort and BSD sort. Msort provides greater flexibility in selecting key fields, more comparison types, the ability to use collation rules from different locales on different keys, the ability to handle numbers in non-Western number systems, and a variety of other options lacking in GNU sort and BSD sort. Whereas msort understands Unicode, GNU sort and BSD sort do not. It is a property of the UTF-8 transfer format that a binary sort will sort in Unicode codepoint order, so for some purposes GNU sort will behave in an acceptable manner on Unicode input. However, operations requiring an understanding of the encoding of the input do not work properly in GNU sort and BSD sort with Unicode input. Capabilities of GNU sort and BSD sort lacking in msort are the ability to merge files without sorting them (the --merge option) and the ability to emit only the first of an equal run (the --unique option).
Generally speaking, msort is the more powerful program, either the only choice or the more convenient choice in cases in which something other than standard sorts of positionally selected fields are required. On the other hand, if GNU sort or BSD sort is capable of doing what you want, it will generally be faster. The exact ratio varies with the details of the sort and the nature of the input, but in my tests, where msort and GNU sort are capable of performing the same sort, GNU sort is typically several times faster than msort. BSD sort seems to be slightly faster than GNU sort.
Language | C | main program |
Tcl/Tk | for GUI only | |
Dependencies | TRE regular expression library | required |
ICU - International Components for Unicode | one or the other required | |
Utf8proc | ||
Uninum number conversion library | optional | |
GNU MP multiple precision arithmetic library | optional used by uninum | |
Tcl/Tk version 8.3 or higher | for GUI only | |
Iwidgets (Tcl/Tk library) | for GUI only | |
License | GNU General Public License,Version 3 | |
Current version | 8.53 | |
Last modified | 2010-01-10 |
A standard Unix manual page is included in the package, or you can read it here. The full documentation is the reference manual (PDF), a copy of which is included in the package.
The manual contains a number of examples, including how to use msort to sort SIL Standard Dictionary Format files as used by Shoebox and Toolbox.
File | Size (Bytes) | MD5 Sum |
---|---|---|
msort-8.53.tar.bz2 | 440,307 | 01e78967b4e4197f867831f8c8f4c48d |
msort-8.53.tar.gz | 476,722 | a6468fbb8503bb52331994f96eb7b54c |
msort-8.53.zip | 535,715 | 255966cfcf0470de93572e4f714707f8 |
If you would like to be notified of new releases, subscribe to msort at Freshmeat.
The underlying command-line program msort should compile and run without difficulty on any POSIX-conformant system on which the requisite libraries are available. In practice, this should mean just about anywhere. It is known to compile and run without modification under GNU/Linux, FreeBSD, Mac OS X, and SunOs. I am note sure whether the current version will compile and run properly under MS Windows, even under Cygwin, due to the fact that MS Windows uses UTF-16 Unicode internally while msort expects UTF-32.
Note also that msort may be configured to compile without the GMP and Uninum libraries, at the cost of forgoing the ability to handle numbers in non-Western number systems. If you cannot or do not want to install these libraries, run configure with the option --disable-uninum. This will also disable linkage with libgmp.
The graphical user interface should run anywhere that Tcl/Tk is available, but a few features may not work on non-Unix systems. In particular, the Abort Sort command depends on the existence of a Unix-style kill program that can be used to send a signal to another process. It is known to run under GNU/Linux, FreeBSD, and SunOS. msg will run properly under Mac OS X if you have installed X11 and use Tk-X11. msg now adapts itself to Tk-Aqua sufficiently well as to be usable, but some details remain to be dealt with.
The GUI requires both the basic Tcl/Tk distribution and the iwidgets library. If you already have Tcl/Tk and just need to add iwidgets, you can obtain the package from the Sourceforge project site. On the download page you will find source and binary packages for both [incr Tcl/Tk], which is the basic part of this package, and [incr widgets], which is the part that contains the widgets. You will need to install both. (iwidgets is an alternative name for [incr widgets].)
The easiest way to obtain the Tcl/Tk environment you need is to install the ActiveTcl distribution from ActiveState. This distribution provides the Tcl language, the Tk graphics library, and a bunch of extensions, including [incr tcl] and [incr widgets]. Don't be concerned by the fact that ActiveState is a commercial outfit. The Tcl/Tk distribution that they provide is free as in both beer and speech. They make their money selling services and programming tools. The ActiveTcl distribution is currently available for: GNU/Linux, HP-UX, AIX, Solaris, Mac OS X, and MS Windows.
For FreeBSD, Tcl and Tk are available at:
Under obscure conditions date sorts may produce a segmentation fault or valid date fields may be rejected as invalid. I have been unable to reproduce this bug on my own system. It may or may not be significant that the machine on which this bug has been reported is a 64-bit machine.
Known bugs in the GUI are:
If you care about any of these, please feel free to drop me a line.