Multilingual Support - Concepts and Terminology

=== Top of the Swiki === Attachments ===

Multilingual Support - Concepts and Terminology

[27-Sep-99 / hh] Excellent illustrative summary:
http://www-4.ibm.com/software/developer/library/utfencodingforms/index.html
Emphasizes the distintion between representation and rendering (character and glyph).

Glossary: http://www-4.ibm.com/software/developer/library/glossaries/unicode.html

Citation from: http://lcweb.loc.gov/marc/marbi/1998/98-18.html

This note follows the terminology practice of international standards that refer to a sequence of eight bits as an octet rather than a byte, even though in today's world a byte is almost always an octet.

Unicode (trademark) is the coded character set now defined by The Unicode Standard, Version 2.0, but it should be understood to include later versions as they result from the process of maintaining the exact correspondence of character repertoire and code point assignments with ISO/IEC 10646. Though they are identical in those respects, Unicode differs from ISO/IEC 10646 in defining character semantics and properties to facilitate interoperability between conformant applications. These definitions, incorporated in The Unicode Standard, Version 2.0, are an integral part of the concept of Unicode.

ISO/IEC 10646, the Universal Character Set (UCS) standard, defines two forms of encoding. The more capacious requires 31-bits per character, permitting the definition of a very large repertoire. Because a 31-bit character occupies four octets, this form is known as UCS-4. The other form requires 16 bits (two octets) per
character; hence it is called UCS-2. The 65,535 values that can be represented in UCS-2 are enough to encompass most of the characters used in contemporary languages, and UCS code values for them have been assigned in that range. The set of possible UCS-2 values therefore has another name, the Basic Multilingual Plane (BMP) of the UCS.

No UCS character assignments outside the BMP have been made. The character repertoires and code value assignments in the BMP and in Unicode are the same.
In this sense Unicode and the UCS BMP are effectively synonymous. Unicode includes a stratagem, "surrogates," that can provide access to roughly a million non-BMP characters that may be assigned in the future. Such assignments are likely as coverage of ideographs becomes more comprehensive.

Another ISO/IEC 10646 concept is the UCS Transformation Format (UTF). UTFs are alternative representations of UCS-4 and UCS-2. They are designed to enable communication protocols to transfer UCS data without confusion or loss. A feature of UTF representation is that not all characters require the same number of bits.

UTF-16 expresses a UCS character as a sequence of one or more 16-bit sequences. This format is a "transformation" only for characters that cannot be represented in UCS- 2. For those that can, the UCS-2 and UTF-16 encodings are identical.

UTF-8 provides for safe transmission in 8-bit environments, such as the Internet. It expresses a character as one or more octets. ASCII characters require a single octet. Other BMP characters require two or three. An 8-bit ASCII character and its UTF-8 encoded value are identical.

UTF-7, devised to support 7-bit transfer protocols such as MIME, also expresses characters as sequences of octets but necessarily uses a more restrictive rule than UTF-8 about what each octet can contain.

USM-94. During its deliberation, the task force found it convenient to have a concise way to refer to the USMARC character repertoire and encoding that are currently in use; that is, to abbreviate "the ASCII and ANSEL character sets (except for certain ANSEL characters,) special escape sequences for a limited set of
subscripts, superscripts, and Greek letters, and the ISO 2022 (X3.41) escape sequences for Arabic, Cyrillic, Hebrew and CJK, and the character sets to which those sequences provide access." The term USM-94 has been coined for this purpose.