Finnish translation of this page.
Polish translation of this page
his package provides conversion in both directions between UTF-8 Unicode and more than thirty 7-bit ASCII equivalents, including RFC 2396 URI format and RFC 2045 Quoted Printable format, the representations used in HTML, SGML, XML, OOXML, the Unicode standard, Rich Text Format, POSIX portable charmaps, POSIX locale specifications, and Apache log files, and the escapes used for including Unicode in Ada, C, Common Lisp, Java, Pascal, Perl, Postscript, Python, Scheme, and Tcl.
Such ASCII equivalents are useful when including Unicode text in program source, when debugging, and when entering text into web programs that can handle the Unicode character set but are not 8-bit safe. For example, MovableType, the blog software, truncates posts as soon as it encounters a byte with the high bit set. However, if Unicode is entered in the form of HTML numeric character entities, Movable Type will not garble the post.
It also provides ways of converting non-ASCII characters to similar ASCII characters, e.g. by stripping diacritics.
For example, here is the Chinese for regular expression in Unicode:
正規表達式and here is the HTML hexadecimal numeric character reference output from uni2ascii:
正規表達式
The package consists of two programs: uni2ascii and ascii2uni.
Here is a list of the ASCII representations of Unicode known to me with indications of their usage.
The Unicode escapes handled include:
Microsoft-style HTML character entities and numeric character references without the final semi-colon are converted with a warning message.
The package can also be used to convert from one type of ASCII representation to another by passing through Unicode. For example, the pipeline:
ascii2uni -a U | uni2ascii -a J
will convert from \u-escapes (e.g. \u00e9) to RFC2396 URI format (e.g. %C3%A9).
ascii2uni -a H | uni2ascii -a D
will convert HTML hexadecimal numeric character references to decimal numeric character references.
ascii2uni -a H | uni2ascii -a H -a Q
will convert HTML hexadecimal numeric character references to HTML character entities where equivalent character entities exist, and
ascii2uni -a M | uni2ascii -a H
will convert SGML hexadecimal numeric character entities to HTML.
uni2ascii can also replace non-ASCII characters with approximate ASCII equivalents. For example, it can replaced stylistic variants (e.g. bold-face) with their plain counterparts, or characters with accents with their unaccented equivalents.
uni2ascii and ascii2uni are provided with standard Unix manual pages:
Both programs also provide a detailed summary of their command line options in response to the -h command line option.
If you need to convert between UTF-8 Unicode and other encodings, you may find enca, iconv, recode, and uniconv useful. If you need to convert between textual representations of numbers and machine representations, you may find the programs ascii2binary and binary2ascii helpful. If you need to find out more about what is in a Unicode file (e.g. if you don't know the writing system, don't have the necessary font, think that the Unicode may be ill-formed, or need to examine details of representation such as composition) you may find the Unicode Utilities suite of programs useful.
Language | C [basic programs], Tcl/Tk [GUI] |
Environment | POSIX |
License | GNU General Public License, version 3 |
Current version | 4.20 |
Last modified | 2019-06-28 |
Contact | Bill Poser |
File | Size (Bytes) | MD5 Sum |
---|---|---|
uni2ascii-4.20.tar.bz2 | 127,125 | a1b1df74cccd1fa997bad79c8c4ced68 |
uni2ascii-4.20.tar.gz | 160,182 | 096cf1b70a55c4796b136ff1a126a940 |
uni2ascii-4.20.zip | 174,602 | 3842bcc366ca5b2d98c63c289cc550a2 |
uni2ascii and ascii2uni have been compiled and tested under FreeBSD, GNU/Linux, macOS and SunOS. They should compile and run without modification in any POSIX-compliant environment.
ascii2uni contains a bug that affects impure mode conversions of standard hex (-X option). Version 3.9.2 fixes the bug for inputs within the BMP, that is, for hex values less than or equal to 0xFFFF. A more general fix is anticipated.