uni2ascii

his package provides conversion in both directions between UTF-8 Unicode and more than thirty 7-bit ASCII equivalents, including RFC 2396 URI format and RFC 2045 Quoted Printable format, the representations used in HTML, SGML, XML, OOXML, the Unicode standard, Rich Text Format, POSIX portable charmaps, POSIX locale specifications, and Apache log files, and the escapes used for including Unicode in Ada, C, Common Lisp, Java, Pascal, Perl, Postscript, Python, Scheme, and Tcl.

Such ASCII equivalents are useful when including Unicode text in program source, when debugging, and when entering text into web programs that can handle the Unicode character set but are not 8-bit safe. For example, MovableType, the blog software, truncates posts as soon as it encounters a byte with the high bit set. However, if Unicode is entered in the form of HTML numeric character entities, Movable Type will not garble the post.

It also provides ways of converting non-ASCII characters to similar ASCII characters, e.g. by stripping diacritics.

For example, here is the Chinese for regular expression in Unicode:

正規表達式

and here is the HTML hexadecimal numeric character reference output from uni2ascii:

正規表達式

The package consists of two programs: uni2ascii and ascii2uni.

Here is a list of the ASCII representations of Unicode known to me with indications of their usage.

The Unicode escapes handled include:

HTML hexadecimal numeric character references (e.g. é)
HTML decimal numeric character references (e.g. é)
HTML character entities (e.g. é)
SGML hexadecimal numeric character references (e.g. \#x00E9;)
SGML decimal numeric character references (e.g. \#0233;)
\u-escaped hexadecimal, as used in Python and Java (e.g. \u00E9)
\u-escaped hexadecimal within the BMP, \U-escapes beyond the BMP, (e.g. \u00E9 but \U00010024) as used in Tcl and Scheme.
\u-escaped decimal (e.g. \u0233) as used in Rich Text Format
U+-escaped hexadecimal (e.g. U+00E9) as in the Unicode standard
U-escaped hexadecimal (e.g. U00E9)
u-escaped hexadecimal (e.g. u00E9)
U-escaped hexadecimal within angle brackets (e.g. <U00E9>) as used in POSIX locale specifications
\x-escaped hexadecimal (e.g. \x00E9) as used in Tcl for numbers as opposed to characters
\x-escaped hexadecimal with braces (e.g. \x{00E9}) as used in Perl
hexadecimal within single quotes with prefix X (e.g. X'00E9')
RFC 2396 URI format (e.g. %C3%A9)
RFC 2045 Quoted Printable (=-escaped hexadecimal UTF-8) e.g. =C3=A9
\-escaped octal UTF-8 (e.g. \303\251)
Hexadecimal UTF-8 with each byte enclosed in angle brackets (e.g. <C3><A9>)
Standard hexadecimal (e.g. 0x00E9)
Raw hexadecimal (e.g. 00E9)
Common Lisp hexadecimal format (e.g. #x00E9)
Perl v-prefixed decimal format (e.g. v233)
Hexadecimal numbers preceded by "$" (e.g. $00E9).
Hexadecimal numbers preceded by "16#" (e.g. 16#00E9) as in Postscript.
Hexadecimal numbers preceded by "#16r" (e.g. #16r00E9) as in Common Lisp.
Hexadecimal numbers preceded by "16#" and followed by "#" (e.g. 16#00E9#) as in ADA.
OOXML hexadecimal numbers preceded by "_x" and followed by "_". (e.g. _x00E9_)
Hexadecimal numbers preceded by "%u" (e.g. %u00E9)

Microsoft-style HTML character entities and numeric character references without the final semi-colon are converted with a warning message.

The package can also be used to convert from one type of ASCII representation to another by passing through Unicode. For example, the pipeline:

ascii2uni -a U | uni2ascii -a J

will convert from \u-escapes (e.g. \u00e9) to RFC2396 URI format (e.g. %C3%A9).

ascii2uni -a H | uni2ascii -a D

will convert HTML hexadecimal numeric character references to decimal numeric character references.

ascii2uni -a H | uni2ascii -a H -a Q

will convert HTML hexadecimal numeric character references to HTML character entities where equivalent character entities exist, and

ascii2uni -a M | uni2ascii -a H

will convert SGML hexadecimal numeric character entities to HTML.

uni2ascii can also replace non-ASCII characters with approximate ASCII equivalents. For example, it can replaced stylistic variants (e.g. bold-face) with their plain counterparts, or characters with accents with their unaccented equivalents.

Documentation

uni2ascii and ascii2uni are provided with standard Unix manual pages:

Both programs also provide a detailed summary of their command line options in response to the -h command line option.

Related Programs

If you need to convert between UTF-8 Unicode and other encodings, you may find enca, iconv, recode, and uniconv useful. If you need to convert between textual representations of numbers and machine representations, you may find the programs ascii2binary and binary2ascii helpful. If you need to find out more about what is in a Unicode file (e.g. if you don't know the writing system, don't have the necessary font, think that the Unicode may be ill-formed, or need to examine details of representation such as composition) you may find the Unicode Utilities suite of programs useful.

Details

Language	C [basic programs], Tcl/Tk [GUI]
Environment	POSIX
License	GNU General Public License, version 3
Current version	4.20
Last modified	2019-06-28
Contact	Bill Poser

Downloads

File	Size (Bytes)	MD5 Sum
uni2ascii-4.20.tar.bz2	127,125	a1b1df74cccd1fa997bad79c8c4ced68
uni2ascii-4.20.tar.gz	160,182	096cf1b70a55c4796b136ff1a126a940
uni2ascii-4.20.zip	174,602	3842bcc366ca5b2d98c63c289cc550a2

Packages

Arch Linux: uni2ascii
Debian: Debian package (stable); Debian package (testing); Debian package (unstable)
FreeBSD: Freshport
macOS: MacPorts
macOS: Fink .
OpenPackage: OpenPackage
Redhat/Fedora: RPMs for a variety of architectures are available here.
Redhat/Fedora: A source RPM and a binary RPM for the i386 architecture are available here.
SUSE Linux: RPM
Ubuntu: Ubuntu

Environment

uni2ascii and ascii2uni have been compiled and tested under FreeBSD, GNU/Linux, macOS and SunOS. They should compile and run without modification in any POSIX-compliant environment.

Change Log

4.18 - 2011-05-15

Fixed bug in uni2ascii in which in certain cases the subsitution count was too high, fixing Debian bug #626268.
Patched to handle situation in NetBSD which lacks getline.
Clarified semantics of pure option as converting characters in ascii range other than space and newline. Fixed bug in which this was not implemented correctly for UTF8 types.

4.17 - 2011-02-16

Added to uni2ascii the following conversions to nearest ascii equivalent: U+2022 bullet to 'o', U+00B7 middle dot to period, U+0085 next line to newline, U+2028 line separator to newline.

4.16 - 2010-12-12

The Q format works again in ascii2uni.
Added U+2033 DOUBLE PRIME to the characters converted to their closest ascii equivalent under using the e format in uni2ascii.

4.15 - 2010-08-29

Renamed endian.h to u2a_endian.h to eliminate conflict with external endian.h.
Removed copy of GNU getline from ascii2uni.c as it is in standard library as of POSIX2008.

Full Change Log

Roadmap

In a few cases there are ambiguities in parsing the desired strings out of the surrounding material. Depending on the case, either these need to be resolved or the problematic cases documented.

Bugs

ascii2uni contains a bug that affects impure mode conversions of standard hex (-X option). Version 3.9.2 fixes the bug for inputs within the BMP, that is, for hex values less than or equal to 0xFFFF. A more general fix is anticipated.

Back to Bill Poser's software page.