Skip site links
Skip navigation
Skip to main content

My AAMVA Log In

If you are a member, please Log In or Register Now!

Character Encoding

Character Encoding Schemes

Each letter, number, punctuation character and control character used on a computer, has a binary value associated with it. Over the years a number of computer developers have defined a number schemes for coding the characters. Unfortunately the schemes encode characters differently. The letters and numbers on a U.S. keyboard are not normally a problem in North America as the translation of these characters between different encoding schemes is straightforward. However the other characters on the keyboard and those not on a U.S. keyboard can be problematical as sometimes there is no corresponding character for the translation. Examples of data that works in it own environment but can cause problems in other environments include:

  • "null" (i.e. a binary value of zero used for initialization on a mainfame),
  • "tab" character (i.e. written as a separator in a Word document and communicated in XML) and
  • accented letters, (i.e. the Spanish enye character "Ñ")

The following is a summary of some of the more common encoding schemes:

EBCDIC: Extended Binary Coded Decimal Interchange Code:

  • Is used on many mainframe computers.
  • Currently has a number of variants, the variant is normally tailored to the country of use, i.e. EBCDIC-US has a "$" character, EBCDIC-Spain has a "Ñ". Numbers and English letters always use the same hexadecimal values, but the hex values used by the extra characters have different meanings depending on the variant used.  The default in IBM Enterprise COBOL in the U.S. is EBCDIC 1140.
  • Uses 8 binary bits for each character.

ASCII: American Standard Code for Information Interchange:

  • Is used on UNIX and older Windows computers.
  • Has had a number of variants over the years, in the U.S., ASCII - ISO 8859-1 (Latin-1) is currently the most common. (In 2004 the 8859-1 working group stoped working on 8859-1, in order to concentrate on UNICode).  
  • Mostly uses 8 binary bits for each character.  In all variants of ASCII numbers and English letters map to the same hexadecimal values in the range 00 to 7f. Above 7f different characters have been assigned differently in different variants. The mapping in ASCII 8859-1 uses the UNICode table U000 Basic-Latin for hex values 00 to 7f followed by UNICode Latin-1 for hex values 80 to ff, which supports characters from other West European languages).

USC: UNICode Character System:

  • Is used on newer Windows computers. 
  • Uses 16 binary bits for each character. It runs from hex value 0000 to ffff.
  • USC is sometimes called Code Page 1252. . Again it uses UNICode tables.

UTF-8: UCS/Unicode Transformation Format

  • Is used in XML and web-services.
  • Is currently the most popular is variant of UTF.
  • Uses 8 binary bits for common characters, and then switches to 16 or more bits for less common characters.  UTF-8 is another scheme using UNICode.

UNICode:

  • Is not directly implemented on hardware, however hardware may use a scheme which uses UNICode.
  • It has multiple tables that are allocated based on a range of hex values. Its large range of hex values provides support for most modern scripts and also supports scripts of ancient languages. The tables are identified by their starting hex value.  Table U000 - Basic-Latin runs from hex values 00 to 7f and covers the characters on an English keyboard. Table U0080 - Latin-1 runs from hex values 80 to ff and covers additional European characters. At this time the UNICode tables go on up to tables UFF00 – Korean Half width Jamo.

Standards Using Character Encoding Schemes

In the motor vehicle arena, the following standards may be used for communications. A description of some limitations with the character encoding is included.

AMIE: On the AAMVA network, messages are restricted to characters which can be used in ASCII, EBCDIC-1140 and UTF-8 encoding schemes. They are sometimes referred to as “Printable Characters”. This limitation has been imposed because different types of computers use different data encoding schemes.  So to communicate across the network, only the characters that can be used are those that are common to all the computers connected to the network. Other characters are not transferable between ASCII, UTF-8 and EBCDIC, and must be excluded. The characters allowed are:

  • space
  • a to z
  • A to Z
  • 0 to 9
  • ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ \ _ { | } ~

XML: XML normally uses the UTF-8 encoding scheme. UTF-8 uses the UNICode character tables so supports characters from many alphabets, including characters for asian words. It supports the white space characters "space", "tab" and "paragraph" (carrige return-line feed). It cannot handle other binary data which is not mapped to a printable character. 

Another twist in XML is escape characters for characters < > & ' and ". In a XML document these are represented as &lt, &gt, &amp, &apos, &quot.  So "Speeding > 5mph over "posted limit" becomes "Speeding &gt 5mph over &quotposted limit&quot". When XML data is processed (i.e. in a web service or a parser) it may have the escape characters converted to and from their xml form. An unfortunate situation existed in early releases of XML software, in that different software converted different lists of escape characters.

ANC X12 and EDIFACT.  Conflicts between data and X12 separators, can occur.  X12 does not define any specific characters to be used as delimiters. Delimiters are instead defined for each interchange by their first use in the interchange start segment. The asterisk is the preferred data element separator, but others could be specified. The paragraph mark (carriage return) is preferred as the data segment terminator. The only constraint on the choice of delimiter characters is that they are not to be used elsewhere in an interchange.

In practice problems have been identified because some organizations have adopted X12 standards which use asterisk as a separator (i.e. the Automated Commercial Environment's (ACE) truck e-manifest ) and some motor vehicle agencies use asterisk in their data (i.e. in the license number). 

Line Sequential Data: This is common in legacy systems for identifying the end of a record or data field. The "paragraph mark" is used as a delimiter between records or field, so the paragraph mark must not be included in any of the data.

ICAO 9303 format for Names: Is a standard used internationally on travel documents like passports. It uses the "<" character as a delimiter between names.

2D Bar Code PDF417: This is a requirement in the AAMVA DL-ID Card Standard. It uses a few binary control characters as delimiters.

Magnetic Stripe:  This is an option in the AAMVA DL-ID Card Standard.  It uses most of the non-letter and non-number characters as delimiters and control characters.


Contact: Mark Pritchard, (703) 908-5790