SURVEY OF CODE PAGE HISTORY

This document is sure to be incomplete and cannot be faultless. Nevertheless, I have tried to make it as useful as possible.

 

Introduction.. 3

The historic period of mainframes and minicomputers. 4

Table of numbers of  „Non-English“ characters in the alphabets of some selected languages  5

References. 6

The Beginning of the Eight-Bit Era.. 7

Some examples of code tables (random sequence):7

References. 8

The DOS PC Period.. 9

Popular DOS Code Pages. 9

Difference Between American  PC-8 (CP437) and Other CPs. 11

Consequences of Differences Between Individual CPs. 12

Notional problems. 13

How to Find Correct Functioning of a National Alphabet Program in the DOS Operating System    13

Problems with Programs Expected to Work with Multiple Alphabets. 13

References. 14

ISO 8859-x.. 15

Language coverage by ISO-8859-x.. 15

References. 17

Windows 3.1. 18

Code pages. 18

Known Windows Code Pages (some of them were apparently used much later than in the period of  Windows 3.1)18

Fonts. 19

Consequences of Differences Between OEM and Windows Code Pages. 20

How to Check Correct Functioning of National Alphabet Program under Windows Operating System    21

References. 23

Windows 9x, Windows NT.. 24

Code Pages. 24

Fonts. 24

Font Usage Compatible with Windows 3.1. 25

Relation to DOS. 26

Possibility of code page identification on text entry.. 26

References. 26

Unicode.. 27

Unicode Features. 27

Inconveniences of Unicode. 28

Basic Forms of Unicode Record.. 28

References. 29

 


Introduction

In the beginning there was The Word. The Word was written in the seven-bit ASCII.

And the Word spread. From its native America to other parts of the world. To areas where the Latin alphabet was not used, to areas where letters were accompanied with diacritics forming new letters. The seven-bit world (where each letter was represented with a number between 0 and 127 in the computer) began to lose pace.

So the time came for the eighth bit of the Byte (until then neglected) to have its say. Originally used for a mere control of the transferred data with the help of parity, the eighth bit now extended the character set by 128 new positions.

The numbers were used for identification of characters of national alphabets and of other graphic symbols such as currency signs or frames. It soon became obvious that with regard to the variety of all the different languages and of all the individual letters within them even the 256 characters would not be enough to cover all the letters of all the different world alphabets either. The simplest and then sufficient solution was to use an individual character code for each individual language. Thus all the letters of a language could always be placed within the set of the 256 positions available.

However, the system displays a number of principal failures: Although it is possible to transfer documents within an environment using the same character set, a text originally written for example in Czech remains very difficult to transfer and display for example into the French environment – for example the number for the Czech letter "š" might well mean "1" in French.

In addition, virtually every company with an effect on IT considers it an obligation to develop and promote their own encoding – for example the small Czech Republic used more or less six (!) different encoding systems for the Czech language – i.e. six different computer representations of real Czech letters. It is users only who pay for this absolute lack of consideration on the part of the computer companies. The only way out of this hard situation is purposeful promotion of international standards independent of computer companies. Even in the Czech Republic the situation is slowly but surely beginning to consolidate (perhaps also thanks to Inet). The support for ISO-8859-2 is beyond discussion: this encoding must be acceptable for any e-mail or WWW client, claiming to be compatible with MIME. Thanks to the monopolistic position of Microsoft in the area of operating systems the second most widely used representation will long remain windows-1250. The other forms of encoding are slowly dying away, being gradually replaced with the quicker and more stable  Unix, or the ordinary user-oriented MS Windows.

The fact that it is sometimes even difficult to agree on a single encoding of a language shows what problems must be linked with transfers of documents between languages. Each language has its own set of characters, sometimes even in more variants, and a document written in other but the native language is most likely to be displayed wrongly. The ever-increasing need for electronic exchange of documents inevitably leads to the need for a global alphabet enabling easy exchange of documents. As the global use of English (and thus elimination of the problems of national alphabets) is still the question of a very distant future, it was necessary to find another solution.

From www.cestina.cz - Unicode – A Way Out from the Chaos of Character Encoding,  Jaromír Dolecek, ÚVT MU, 22 March  2000

The historic period of mainframes and minicomputers

The first computers one could encounter were happy to be able to work with numbers, a couple of symbols and capitals of the English alphabet. At that time there were many different ways of presentation of characters in computer. Most of them were based on the miniature memories of the computers of the time only allowing very short displays. The most frequently used starting point was the then widespread five-bit punched tape.

With improved peripheries and increased memories the eight-bit presentation of characters in the memory began to be introduced. At first the character sets were the same as before: capital letters of the English alphabet, figures and selected mathematical symbols and punctuation marks. At that time there was a great difference between the world of mainframe computers as for example the IBM 360 series, and the rest of the systems. One of them used the encoding known as EBCDIC based on the punched card code. (The role of  EBCDIC decreased with time and there is no point dealing with it any more). The other computers mostly used the seven-bit code derived from the eight-bit international telegraphic alphabet. This encoding became the basis for the present ASCII. In the course of the development of input and output peripheries small letters of the English alphabet were introduced and so all of the 128 characters began to be used in the same code. The eight bit was still used for the parity control of transfer and for that reason did not serve as a carrier of meaning.

The ASCII (American Standard Code for Information Interchange, ANSI X3.4-1968) table was created on the basis of a clear logic:

Decimal values of characters

Meaning

0 - 31

Service characters for communication control, the so called control codes

32

Space

33 - 63

Mathematical signs and punctuation marks, figures dollar sign

64 – 95

Capital letters of the English alphabet with a couple of other characters @,[,] and |

96 – 127

Small letters of the English alphabet with a couple of other characters `, {, }, \

 

The letters were encoded in the alphabetical order, a simple change of a single bit changed small into capital letters and vice versa.

This character structure was not convenient for countries using characters not included in the table. A good example might be Great Britain missing the pound sign. Therefore at that time the differentiation of individual national ASCII tables began. Individual nations replaced some of the included characters with others they needed more. The typically omitted characters included for example #, $, @, [], | and their equivalents from among small letters – for modification 12 ASCII characters were used. Residues from that period may still be found in the adjustment manuals for printers allowing the use of some of the language variants of the seven-bit encoding. The nations at least partly satisfied with this solution mostly included West European nations such as the French, the Germans, the Spaniards (2 versions), the English, the Danes (2 versions), the Swedish, the Italians, the Norwegians, and currency symbols of Japan and Korea. Another frequently used option was double printing of some characters – a character as the first click with a diacritic over it as the second click on the same position.

ASCII and its national variants were internationally standardised in 1972 as ISO-646.

 

Table of numbers of  „Non-English“ characters in the alphabets of some selected languages

(The table has been processed on the basis of the document „Repertoire of characters used for writing the indigenous languages of Europe“, CEN/TC304 N634 fourth draft)

Alphabet

Number of needed „new“ characters

Albanian

4

English

20 (including characters of other alphabets used in English words)

English – only

1 (Pound symbol)

Czech

30

Danish

24

Estonian

12

Finnish

10

French    

30

Hebrew

39

Dutch   

10

Croatian

10

Irish

10

Icelandic

20

Lithuanian

18

Latvian

26

Hungarian

18

German

7

Norwegian

12

Polish

18

Portuguese

26

Rumanian

10

Russian

47

Greek

52

Slovak  

24

Slovenian

6

Spanish

14

Swedish

10

Turkish

16

The numbers do not include characters of languages of the neighbours – for example there may be a couple of French characters in German texts etc. The table demonstrates that as for European languages using the Latin alphabet the greatest numbers of characters are needed by the French and the Czech languages. Nevertheless, this is no reason for any nation to curse its predecessors, for no nation is alone in this. And the number of special characters is not a problem by itself, the largest problem is the fact itself of the existence of the difference.

The only method available in the seven-bit environment was switching between the international and the national code table: there were two service characters available for the switching, namely SI and SO („shift in“ and „shift out“), later sequences beginning with ESC were developed for this function. This solution was only temporary and was only used where it was necessary to preserve the parity control bit. The development was quite chaotic and international standardisation considerable lagged behind the practice. This is one of the reasons for the later great number of different code table.

 

References

Repertoire of characters used for writing the indigenous languages of Europe

ISO 646 (Good old ASCII)

 


The Beginning of the Eight-Bit Era

Towards the end of the mainframe and minicomputer era the systems were adopted to work with eight-bit characters, equipped with both the relevant hardware and the appropriate operating systems. The adaptations did not happen overnight, for for example the hard resetting of the eight bit could be found in virtually every minicomputer program (even today the consequences of the condition may be felt in the unreliability of electronic transfer of pure eight-bit characters). The question remained what characters to place in the upper half of the code table. The easiest solution seemed to be the use of the national code tables, switched into the seven-bit mode with the SI and SO characters. The upper half of the table contained service characters (the first 32 characters) – the reason was compatibility with the seven-bit representation in case of the loss of the eighth bit. Even so released capacity was insufficient to cover the whole of Europe, though, let alone some of the more exotic parts of the world.

 

Some examples of code tables (random sequence):

Latin 1

This table includes nearly all letters needed for West European languages, with the omission of only a few rarely used characters. Today this table is known under the standardised name ISO-8859-1. Like in the ASCII table, capital and small letters are distinguished by a single bit.

Latin 2

The table contains all letters needed for Central and East European alphabets based on Latin. Today the table is known under the standardised title ISO-8859-2.

Roman 8

A counterpart to Latin 1 produced by Hewlett Packard. This company standard was (even may still be) used by HP in their computers and peripheries. Even recently the code table has been used in internal HP systems, causing incompatibility of electronic mail between HP and the rest of the world. Unlike Latin 1 there is a different letter structure lacking any visible system.

East 8

A counterpart to Latin 2 produced by HP for Central and Eastern Europe.

MacRoman

A counterpart to Latin 1 produced by Apple for Macintosh computers.

KOI8-R

Russian alphabet table produced in Russia. This table represents a firm national standard for countries using the Russian alphabet (Cyrillic). This is still the basic charset used by Russian Internet. Like in the ASCII table, capital and small letters are distinguished by a single bit. There is a counterpart of KOI8-R used in the Ukraine.

KOI8-CS

Table of Czechoslovak alphabet standardised in the Czechoslovakia of the time. The table was used  in Czech minicomputers and peripheral devices produced then. Unlike the other tables this one contains a special representation of the double character CH, which in Czech is considered a separate letter.  Serious application of this system ended with the CP/M operating system. Like in the ASCII table, capital and small letters are distinguished by a single bit.

 

It is clear that in that period there was no unity of company or national standards, there were many different ones and international standards were much delayed.

 

Operating system support of work with national alphabets and languages was different, sometimes very much advanced.  For example the operating system for HP1000 minicomputers enabled multilingual environment (each used could theoretically work in a different language), the program codes were separated from libraries of naturalised texts. The support of operating system libraries was very much advanced, enabling for example definitions of three-pass sorting algorithms for alphabetical sorting in the Czech language.

 

 

References

Codepage & Co.

 

 


The DOS PC Period

Much of the following lacks logic, we can only hope in its existence, perceiving it so well disguised that only obvious to its creators.

 

PC computers have worked with the eight-bit code table from the very beginning. Unfortunately, all that had been created and achieved with the minicomputers was later thrown away and re-created.

The ASCII table became the IBM-PC company standard, extended by the total of 128 characters in the upper half, including:

·        Some letters of West European alphabets

·        Signs of some world currencies (Dollar, Cent, Yen, Florin)

·        Semi-graphical symbols for easy framing

·        Part of Greek alphabet frequently used in most mathematical or physic texts

·        The most common mathematical symbols

The way of positioning of those characters lacks logic – first there are simply capital and small letters, then semi-graphics, and then Greek letters and mathematical symbols. The character set began to be referred to as Code Page. The Code Page identification shows no system, or logic, we can only suppose that its origin was page numbering of a basic document. The basic American code page is identified with CP437, the West European code page is CP850 and so on and so forth.

 

Popular DOS Code Pages

Identification

Name

MS-DOS/Windows9x support for

437

PC-8

USA,  West European countries, Latin America (not all letters of alphabet in many of them)

708

Arabic

Arabic countries (semi-graphics structured differently from CP437, structuring of Arabic letters corresponds to ISO-8859-6)

720

PC-Arabic

Arabic countries (just the major characters, preserves semi-graphics of CP437)

737

PC-Greek

Greece

774

???

 

775

PC-Latin4 (BaltRim)

Estonia, Lithuania, Latvia

819

ISO-8859-1

(No standard support)

850

PC-Latin1 (Multilingual)

Latin America, Western Europe, Australia, New Zealand, Albania.

System tables of MS-DOS and WINDOWS, in addition, mistakenly mention this CP with some other countries, such as Bulgaria, China, Poland, Czech Republic, Slovak Republic, the Ukraine, Bielorussia, Croatia…

852

PC-Latin2 (Slavic)

Albania, Bosnia & Hercegovina, Croatia, Czech Republic, Slovakia, Hungary (non-Slavonic country), Poland, Rumania (non-Slavonic country), Russia (mistakenly – a different alphabet)

853

PC-Turkish

(contains characters of other alphabets, limited semi-graphic)

855

PC-Cyrillic

Bulgaria, Macedonia, Russia, Yugoslavia (limited semi-graphics)

857

PC-Turkish

Turkey (limited semi-graphics)

858

PC-Eur

Latin America, Western Europe, Australia, New Zealand, Albania.

 (identical with CP850 with the exception of character 213, replaced with the sigh of EUR, no standard support of Microsoft)

860

PC-Portuguese

Portugal

861

PC-Icelandic

Island

862

PC-Hebrew

Israel

863

French Canada

Canada – French

864

PC-Arabic

Arabic countries (part of semi-graphics, structuring different from CP437)

865

PC-Nordic

Danish, Norwegian

866

PC-Russian

Russian (contains semi-graphics as  CP437, there is the ”Latvian“ modification with limited semi-graphics, and  ”Ukrainian“ modification)

867

MJK

(Czech + Slovak Republics – national CP, no standard support of Microsoft, same as 895)

869

PC-Greek2

Greece (limited semi-graphics, letters of CP437 in different positions)

874

PC-Thai

Thailand

895

MJK

(Czech + Slovak Republics – national CP, no standard support of Microsoft, same as 867)

896

Mazowia

(Poland – national CP, no standard support of Microsoft)

932

 

Japan

936

 

China – simplified

938

 

Tai-Wan

949

 

Korea

950

 

China, Tai-Wan – traditional

 

Difference Between American  PC-8 (CP437) and Other CPs

The earlier situation repeats itself. The basic character table contains a great number of characters, but can only be used in some parts of the world unchanged, while in the rest of the world it must be modified. Again the decisive factor is the number of “new” characters to be squeezed into the CP, and the intended range of international use.

Direct modification of CP437 has resulted for example in CPs for Canada (863), Portugal (860), North of Europe/Baltic countries (861,865). In those cases the number of “new” characters was so small that substitution of a couple of unneeded characters from CP437 was usually sufficient. The result followed the structure of the original CP437, including most semi-graphic characters. What usually changed was positions of small and capital letters.

Multi-lingual CPs and other than Latin alphabets usually required more profound modifications – the characters which had to be sacrificed often included most of the Greek alphabet, most of the semi-graphics, and mathematical symbols. Positions of characters and letters of other national alphabets was also one of the aspects that had to be changed, as for example ‚A circumflex‘, occupying position 182 in CP863, and position 182 in CP850, for the original character occupying that position in CP437 was 'a dieresis', which remained in that position in CP850. In multi-lingual CP850 and CP852 identical characters occupy identical positions. However, similar characters could not be placed in the same positions.

 

Many countries, despite the existence of applicable multi-lingual CPs, , were developing national and corporate standards. The reasons included the following:

·        Preservation of semi-graphical and mathematical symbols and Greek alphabet

·        Emergency legibility of the relevant national alphabet by computers with unsuitable hardware (on the basis of visible similarity between some national letters and characters contained in CP437/PC-8)

·        Small or no support of hardware producers for the particular geographical region

·        Other – for example corporate – reasons

So for example Czechoslovakia used two more charsets, Hungary and Croatia at least one more, Poland about eight more character sets.

The consequences of the national and corporate “standards” were unfavourable:

·        There were considerable differences between the individual CPs (the same character had different internal values in different CPs) which led to the need for continuous conversions

·        Lack of functionality and subsequent refusal of operating system support, already too weak

·        Difficulties in international exchange of programs and texts

 

Consequences of Differences Between Individual CPs

The DOS operating system provides basic user and programming support for the complying CPs, including:

·        User support, including software for simple country identification (ambiguously based on country codes used for telephone communication), loading of the relevant character set into the video-card, and a translation program for national keyboard structure. There are special versions of DOS for languages based on other ways of writing (Hebrew, Arabic) or on multi-byte characters (Korean, Chinese, Japanese).

·        Programming support includes DOS operating system functions, enabling:

o       Identification, and change, of country code (in telephone communication)

o       Identification, and change, of the code page set for display (within the limits given for the particular country)

o       Acquisition of the small to capital letter translation table

o       Acquisition of the character table for use in filenames

o       Acquisition of the table showing alphabetical order of characters

o       Acquisition, and change, of information about the used keyboard structure code

o       Acquisition of information about display of figures, date and time

o        

Unfortunately, this support is insufficiently documented and lacks some of the important functions, such as: a translation table for capital to small letter translation, a table classifying letters and other characters, etc. It is also unclear how much this support is implemented in Windows NT.

Programmer may affect functionality of a program in another environment by:

·        Considering application outside the USA (modifiable translation and sorting tables, character classification tables). Generally program coming from outside the USA or the other English-speaking countries, primarily programs coming from Russia or Eastern Europe, are easier to adapt. The disadvantages include the fact that each programmer usually creates his own standards. Another important aspect is the consideration of what are the relevant and the irrelevant components of the program:

o       Whether to use semi-graphics and mathematical symbols (semi-graphics will for example be problematic in countries using other than Latin alphabet)

o       Whether it is necessary to use single and double line crossings in frames (those crossings for example are not contained in multi-lingual CPs, and can easily be substituted for with a single (non-crossing) line without any decrease in the aesthetic quality)

o       Whether it is necessary to translate between small and capital letters

o       Whether it is necessary to distinguish between letters and other characters

o       Whether it is necessary to classify characters, and if yes, then how accurate the classification should be

o       Whether the program is to cooperate with others, or whether it can be operated in more than one language version

o       .…

·        Use of support built in the operating system. Some “shrewd” programs shall create their own font definitions in the video-card and their own keyboard structure. Cooperation of such programs with the operating system is usually more negative than positive.

 

Notional problems

 

There is a considerable variety and inaccuracy in the use of the basic notions.

1.      Character sets are now called “Code Pages“

2.      ASCII is often mistakenly used for identification of pure text in any eight-bit character set (code page), even if this has nothing to do with the original American standard for seven-bit character encoding.

3.      In newer materials Microsoft prefer to use the notion OEM code page for DOS character set used in the particular country, though the OEM abbreviation should correctly be used for hardware component manufacturers.

 

How to Find Correct Functioning of a National Alphabet Program in the DOS Operating System

1.      Check whether all characters, above all letters, of the applied code page can be used with the program, and whether their representations are correct.

2.      If the program performs translations of small letters into capitals and vice versa under some conditions, check correctness of the translations.

3.      Check whether all letters can be put into filenames. There is a widespread rumour saying that filenames can only include some ASCII seven-bit characters. The truth is, however, that the operating system refuses some special characters in filenames, whose value is less than 32dec (lower than the space) and a couple of selected special-purpose signs, such as:
                            
."/\[]:|<>+=;,   
This behaviour also conditions compatibility of the file system with Windows.

4.      If the program involves the sorted representation function (filenames, personal names etc.), check if all letters are put in their places, at least approximately, on the basis of the particular national custom. Common errors involve placement of nation-specific characters after the English ‚Z‘.

Problems with Programs Expected to Work with Multiple Alphabets

 

PC hardware does not enable changes of the displayed character repertoire within a single screen. This results in considerable limitations for programs that need to work with multiple alphabets. Typical examples include mail and Internet clients and text processors.

1.      The simplest way is to only display characters with identical manifestations both in the data code page and in the code page set for the display adaptor. This solution, however, is the least convenient one, for it does not even enable informative legibility of the texts. To use this mode for text or data entry is absolutely impossible.

2.      A slightly better option is replacement of non-displayable characters with a pre-defined single character – this at least shows structuring of the text into individual words. The characters most frequently used for this purpose include ? (question mark) or _ (underline). Legibility, however, is not much better than in the previous case.

3.      The most common solution may be replacement of the non-displayable characters with the same characters in their basic form, i.e. without the diacritics, which in practice means translation into the seven-bit ASCII code. Even though this solution is “pure”, for the display is virtually returning into the period before national alphabets, it is insufficient in many cases. Languages using a great number of characters may demonstrate a considerably decreased legibility, for some words may have a different meaning with and without diacritics. Data entry could be performed this way, but with the above limitations.

4.      A much better option is to replace the unknown characters with similar available characters. So for example if there is the need to display Czech Ň ('N caron'), the Spanish ‚N tilde‘ may be used. Even though this will not be quite accurate, the users are sure to understand it well, for both letters are very similar and cannot appear together in a single language. The informative value of such a representation will be much higher than in the previous cases. Such transliteration may even be used for characters of non-Latin alphabets. For data or text entry this method is much more convenient, for it offers a much higher probability of preserving the informative value of the entered text.

5.      The best option is the use of the graphic mode. The program operating in the graphic mode is able to draw anything on the screen or printer. Unfortunately, this mode is not ideal for DOS programs for the following reasons:

a.       It is much slower than text modes

b.      It cannot be used in individual windows of the Windows operating system

c.       Such programs must be constantly updated with regard to new graphic adaptors

 

References

 


ISO 8859-x

In 1987-88 the ISO (International Organization for Standardization) commission published the first eight code pages of the 8859 standard. Those pages have been extended to reach the current number of 15. The development has been carried out by ECMA – European Computer Manufacturer’s Association.

ISO-8859-x

name

Year

 

 

1

Latin-1

1987

 

 

2

Latin-2

1987

Central and East Europe

 

3

Latin-3

1988

 

 

4

Latin-4

1988

Baltic

obsolete

5

Latin/Cyrillic

1988

 

 

6

Latin/Arabic

1987

 

 

7

Latin/Greek

1987

 

 

8

Latin/Hebrew

1988

 

 

9

Latin-5

1989

Counterpart to Latin-1, Icelandic characters replaced with Turkish ones

 

10

Latin-6

 

Based on Latin-4 (incompatible), add characters for Greenlandic Eskimo (Inuit) and Lappish (Sami)

 

11

Latin/Thai

 

Thailand

 

12

 

 

In future may be used for ISCII Indian or Vietnamese

currently unassigned

13

Latin-7

 

Baltic Rim, Latvian characters not found in Latin-6

 

14

Latin-8

 

Celtic – Gaelic and Welsh characters

 

15

Latin-9

1998

Based on Latin-1, adds French and Finnish characters missing from Latin-1 and has the Euro currency symbol.

 

These tables cover most of the world using individual letter-based scripts.

Language coverage by ISO-8859-x

             \ Latin-x

Language \

1

2

3

4

 

 

 

 

5

6

 

 

7

8

9

Albanian

1

2

 

 

 

 

 

 

9

 

 

 

 

14

15

Arabic

 

 

 

 

 

6

 

 

 

 

 

 

 

 

 

Basque

1

 

 

 

 

 

 

 

9

 

 

 

 

14

15

Breton

1

 

 

 

 

 

 

 

9

 

 

 

 

14

15

Bulgarian

 

 

 

 

5

 

 

 

 

 

 

 

 

 

 

Byelorussian

 

 

 

 

5

 

 

 

 

 

 

 

 

 

 

Catalan

1

 

 

 

 

 

 

 

9

 

 

 

 

14

15

Cornish

1

 

 

 

 

 

 

 

9

 

 

 

 

14

 

Croatian

 

2

 

 

 

 

 

 

 

 

 

 

 

 

 

Czech

 

2

 

 

 

 

 

 

 

 

 

 

 

 

 

Danish

1

 

 

4

 

 

 

 

9

10

 

 

 

14

15

Dutch

1

 

 

 

 

 

 

 

9

 

 

 

 

 

15

English

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Esperanto

 

 

3

 

 

 

 

 

 

 

 

 

 

 

 

Estonian

 

 

 

4

 

 

 

 

 

10

 

 

13

 

15

Faeroese

1

 

 

 

 

 

 

 

 

10

 

 

 

 

15

Finnish

(1)

 

 

(4)

 

 

 

 

(9)

(10)

 

 

 

 

15

French

(1)

 

(3)

 

 

 

 

 

(9)

 

 

 

 

(14)

15

Frisian

1

 

 

 

 

 

 

 

9

 

 

 

 

 

15

Galician

1

 

 

 

 

 

 

 

9

 

 

 

 

14

15

German

1

2

3

4

 

 

 

 

9

10

 

 

 

14

15

Greenlandic

1

 

 

4

 

 

 

 

9

10

 

 

 

14

15

Greek

 

 

 

 

 

 

7

 

 

 

 

 

 

 

 

Hebrew

 

 

 

 

 

 

 

8

 

 

 

 

 

 

 

Hungarian

 

2

 

 

 

 

 

 

 

 

 

 

 

 

 

Icelandic

1

 

 

 

 

 

 

 

 

10

 

 

 

 

15

Irish Gaelic (new)

1

 

 

 

 

 

 

 

9

10

 

 

 

14

15

Irish Gaelic (old)

 

 

 

 

 

 

 

 

 

 

 

 

 

14

 

Italian

1

 

3

 

 

 

 

 

9

 

 

 

 

14

15

Latin

1

2

3

4

?

?

?

?

9

10

?

?

13

14

15

Latvian

 

 

 

4

 

 

 

 

 

 

 

 

13

 

 

Lithuanian

 

 

 

4

 

 

 

 

 

10

 

 

13

 

 

Luxemburgish

1

 

 

 

 

 

 

 

9

 

 

 

 

14

15

Macedonian

 

 

 

 

5

 

 

 

 

 

 

 

 

 

 

Maltese

 

 

3

 

 

 

 

 

 

 

 

 

 

 

 

Manx Gaelic

1

 

 

 

 

 

 

 

9

 

 

 

 

14

 

Norwegian

1

 

 

4

 

 

 

 

9

10

 

 

13

14

15

Polish

 

2

 

 

 

 

 

 

 

 

 

 

 

 

 

Portuguese

1

 

3

 

 

 

 

 

9

 

 

 

 

14

15

Rhaeto-Romanic

1

 

 

 

 

 

 

 

9

 

 

 

 

14

15

Romanian

 

(2)

 

 

 

 

 

 

 

 

 

 

 

 

 

Russian

 

 

 

 

5

 

 

 

 

 

 

 

 

 

 

Sámi

 

 

 

4

 

 

 

 

 

10

 

 

 

 

 

Scottish Gaelic

1

 

 

 

 

 

 

 

9

 

 

 

 

14

15

Serbian

 

 

 

 

5

 

 

 

 

 

 

 

 

 

 

Slovak

 

2

 

 

 

 

 

 

 

 

 

 

 

 

 

Slonian

 

2

 

4

 

 

 

 

 

10

 

 

 

 

 

Sorbian

 

2

 

 

 

 

 

 

 

 

 

 

 

 

 

Spanish

1

 

 

 

 

 

 

 

9

 

 

 

 

14

15

Swedish

1

 

 

4

 

 

 

 

9

10

 

 

 

14

15

Thai

 

 

 

 

 

 

 

 

 

 

11

 

 

 

 

Turkish

 

 

(3)

 

 

 

 

 

9

 

 

 

 

 

 

Ukrainian

 

 

 

 

5

 

 

 

 

 

 

 

 

 

 

Welsh

 

 

 

 

 

 

 

 

 

 

 

 

 

14

 

 

Most of the code pages again are compromises, not always fully covering the needs. ISO-8859-1 for example does not contain the Dutch character ‚ij‘, the French ‚oe‘, the German parentheses and some other characters. ISO-8859-5 in turn misses some of the characters needed for the Bulgarian and the Ukrainian alphabets.

The character sets contained in the individual code pages often overlap. This is intentional, for the pages are designed for characters appearing in more of them to always occupy the same position. This is for elimination of the need of translation between individual ISO tables for individual languages.

The ISO-8859-x character set is mostly used for computers using the Unix operating system. They are irreplaceable for Internet users for each client program must understand them as standards.

 

References

The ISO 8859 standard series (English, Francais, Deutsch, Espanol)

ISO 8859-11 Latin/Thai Character Set standard

ISO 8859 Quick Reference

 


Windows 3.1

Code pages

When the graphic interface of the Windows operating system was being prepared the DOS code pages were replaced with new character sets. It was because semi-graphic characters lost their meaning for programs working in the graphic mode. That is why modified ISO-8859-1 character set was chosen as the basic charset for the operating system. This charset is also called WinANSI. The difference between this and the standard ISO-8859-1 lies mainly in the use of characters within the range 128-159 (i.e. control character with the eighth bit). Windows character sets are again called code pages. Although the numbering of Windows code pages, unlike that of OEM pages, is in line, the assignment of individual numbers to particular code pages again lacks logic. (the USA and Western Europe are the third in the row).

 

Known Windows Code Pages (some of them were apparently used much later than in the period of  Windows 3.1)

Identification

Name

Use for individual countries / languages

1250

Central European, WinLatin2

Counterpart to ISO-8859-2: Poland, Czech Republic, Slovakia, Hungary, Rumania, Croatia, Slovenia,…

This table cannot be substituted with  ISO Latin 2, the difference lies in the positions of some letters and signs. The reason was that Microsoft needed to preserve identical positions of some special-purpose characters with WinAnsi – for example (C), (tm), (R),

1251

WinCyrillic

Russia, Bulgaria, Serbia, Macedonia, …

The table differs from ISO Cyrillic and from the Russian standard KOI-8R.

1252

WinAnsi, WinLatin1

Counterpart to ISO-8859-1: Western Europe, Australia, New Zealand, America,…

In contrast to ISO Latin 1 the table has been extended by the French ,OE‘, ‚Z caron‘, ‚S caron‘, plus some other characters.

1253

Greek

Greece

Differs from ISO-8859-7 just in a couple of characters.

1254

Turkish

Turkey

Differs from WinLatin1 like ISO-8859-9 differs from ISO-8859-1.

1255

Hebrew

Israel

Positions of all letters compatible with ISO-8859-8.

1256

Arabic

Arabic countries

Preserves positions of symbols and some characters assigned to them by WinLatin1, Arabic letters are compatible with  ISO-8859-6 only in the first half of the alphabet.

1257

Baltic

Counterpart to ISO Latin 7, identical letter positions in the two tables.

1258

Vietnamese

Vietnam, the table is similar to WinLatin1, and different from the Vietnamese standard VISCII.

932, 936, 949, 950

 

Some East Asian languages use code pages identical with DOS code pages for the same languages.

 

Generally there is no unambiguous character conversion between ISO and Windows charsets. For example ISO-8859-3 is covered with several Windows character sets. On the other hand, some ISO-8859-3 characters have no counterparts in any Windows code page.

The basic Windows code page is fixed from installation and cannot be changed later by any standard procedure. So for example an operating system installed for the use in the USA or Western Europe will have the basic code page 1252 defined on installation, while an operating system for Russia will have 1251, a system to be used in the Czech Republic will include the 1250 code page. This fixed setting affects definition of system fonts.

The use of scripts using different ways of writing (Arabic, Hebrew, Korean, Japanese, Chinese) is not expected for common versions of Windows and is only supported in version for the particular geographical regions.

 

Fonts

The Windows 3.1 graphic interface uses TrueType (files with extension .TTF), containing a single code page each. For programs using multiple code pages the individual fonts are distinguished by name – the font name is followed with a space and an abbreviation of the character set name. While this works with fonts supplied by Microsoft, some other companies do not observe this principle or use other customised identification ways. That is why it is not very convenient to use the font name suffix for code page identification. On the other hand, there is no other possibility of identifying the code page for Windows 3.1 fonts. This presents difficulties for programs, which have to presuppose that the font is selected in the right code page. For that purpose there are test texts in the font selection dialogue.

Code page

Abbreviation

 

1250

CE

Some companies use „EE“

1251

Cyr

 

1252

 

(West)

1253

Greek

 

1254

Tur

 

1255

Hebr

 

1256

Arab

 

1257

Balt

 

1258

Vietn ?

 

 

Windows 3.1 also use fonts other than TrueType – for example VGA resolution fonts, i.e. files with the extension .FON. These include fonts like Courier (do not mix with Courier New, which is TTF), Small Letters, Symbol, MS Sans Serif, MS Serif. There are different rules valid for those fonts:

·        Those fonts only exist once in the system, in the basic code page fixed on installation, and cannot therefore be changed easily.

·        They are used by most programs in an identical manner for menu texts and other important text representations.

The practical, and very inconvenient consequence is that programs for other than currently defined code pages cannot be used. For example in the operating system installed for the USA it is easy to use a French program, for French programs expect the 1252 code page, which is installed in the US operating system, while when using a Czech or Russian program the menus and other excerpts will be distorted, or degraded, and sometimes completely illegible. If the TTF font were used for menus and texts everything would be most likely to display correctly.

System fonts of the FON type and their inconvenient features are still preserved in Windows 9x, and partly also in Windows NT.

 

Consequences of Differences Between OEM and Windows Code Pages

OEM (i.e. DOS) code pages are much different from Windows code pages. That means that letters and signs can be found in different positions and the character sets also differ. For example the German sharp S, ‚ß‘ can be found in position 225 in CP437, CP850 and CP852, and in position 223 in Windows CP1250 or CP1252.

Windows programs must support DOS code pages. The reason is simple: some data may come from DOS operating system programs. A simple principle applies: all files able to read or entry DOS programs must be translatable from or into OEM code page (i.e. DOS). The opposite procedure is impossible, as while Windows know which DOS code page is to be applied and are able to work with is (Windows is nothing more than a DOS super-structure), DOS has no idea of Windows encoding and DOS programs have no translation functions for Windows code pages available.

Other situations requiring conversions – this time optional – are the following:

·        Loading of a user-specified text file (an additional inquiry is needed or an unambiguous identification must be available whether the text is a DOS or a Windows text). A good example is the standard Write or WordPad program – in case of opening a file with the extension .TXT it is possible to select whether the text is a Windows or a DOS text.

·        Loading of a text via a clipboard (Copy/Paste). The receiving program will learn the code page of the loaded data and must respond to it.

 

The program support in Windows is much wider and is described in the documentation of the relevant programming language. The following are the basic translation functions between OEM and Windows code pages:

AnsiToOem, AnsiToOemBuff, OemToAnsi a OemToAnsiBuff. In addition, the national setting (the "locale") affects a number of other functions, as for example isupper, isalpha, strupr, tolower and many others.

Common mistakes include the belief that dialog windows perform OEM conversions automatically if the 'OEM Convert' option is selected. This is a wrong presupposition, for this option is only to provide the possibility of character conversion, or make an adjustment enabling such a conversion. This feature is used for example for making filenames valid, for filenames are automatically converted into OEM code pages internally and only then passed onto DOS.

 

How to Check Correct Functioning of National Alphabet Program under Windows Operating System

1.      Check whether it is possible to load all characters, primarily letters, of the used code page into the program and whether their representations are correct.

2.      If the program, under certain conditions, performs mutual conversions between small and capital letters, check correctness of those conversions.

3.      Check whether all letters may be used in filenames.

4.      If the program includes the sorted representation function (filenames, personal names etc.), check if all letters are placed in their positions (approximately) according to national customs. A common error is placement of characters with diacritics to the end of the alphabet, after English ‚Z‘.

5.      Programs cooperating with DOS programs must be tested much more profoundly:

a.       National characters must be tested for all positions where used-defined data may appear, as it cannot be foreseen where the user might use a national character. For the user all letters must be equal (but some – the English – are more equal than others).

b.      It is necessary to test whether data are stored correctly in OEM encoding in the disc files. For this purpose it is recommended to use one of the file managers known from the DOS environment, as for example Norton Commander, Volkov Commander (free), Disk Navigator (share) etc.

c.       It is usually unnecessary to test all national characters, a few selected ones should suffice where the wrong translation can be seen at first sight. For example the letters positioned on 128 – 159 in an OEM code page are not generally ideal, for an empty space is displayed instead of each of them in the case of a wrong translation. The suitable tests characters include the following:

 

CP 437

Counterpart in CP 1252

161

í

i acute

 

Exclamation mark upside down

162

ó

o acute

 

Cent

137

ë

E dieresis

 

Promille

138

e'

E grave

Š

S caron

225

ß

Sharp s

á

a acute

d.      If the DOS code page used is a multilingual one, it is necessary to test letters differing from CP437, as for example:

 

CP 850

Counterpart in CP 1252

181

Á

A acute

 

Greek me

224

Ó

O acute

 

Greek alpha

e.       In case of potential cooperation with Internet and translations into ISO character sets it is necessary to test also characters showing differences between ISO and Windows character sets.

f.        If the code page used is a non-Western-European one, it is necessary to include letters differing from CP850 in the test, as for example:

 

CP 852

Counterpart in CP 1252

Counterpart in CP 1250

216

ě

E caron

/O

O slash

Ř

R caron

231

š

S caron

Ç

c cedilla

ç

c cedilla

159

č

C caron

¨Y

Y dieresis

ź

Long z (z acute)

253

ř

R caron

ý

Long y (y acute)

ý

Long y (y acute)

167

ž

Z caron

§

paragraph

§

Paragraph

g.       Similar suitable combinations may also be found from the other direction:

 

CP 1252

Correspondence in CP 437, 850

 

253

ý

y acute

2

upper 2

 

225

á

Long a (a acute)

ß

sharp s

 

154

š

S caron

 

No (n/a)

 

 

CP 1252

Correspondence in CP 437

Correspondence in CP 850

243

ó

o acute

 

less than

3/4

 

 

 

CP 1252

Correspondence in CP 1250

Correspondence in CP 852

236

ě

E caron

i'

i grave

ý

y acute

154

š

S caron

š

s caron

Ü

U dieresis

232

č

C caron

e'

e grave

ź

z acute

248

ř

R caron

/o

o slash

Ŕ

R acute

158

ž

Z caron

 

no (n/a)

x

Plus sign ()

 

 

References

 


Windows 9x, Windows NT

Code Pages

Windows 9x and NT externally use the same code page system as Windows 3.1. Internally, however, most programs use the UNICODE 16-bit encoding. The code pages are slightly modified Windows 3.1 versions, at least by the inclusion of the new € Euro currency sign. Although the Euro sign was not included in the first versions of Windows 95 and Windows NT, it is penetrating into them too as a consequence of updating of system fonts on installation of newer versions of Internet Explorer or Office. The Euro sign always occupies the same 80hexa position in all windows code pages.

 

Fonts

 

The graphic interfaces of Windows 9x and NT newly use TrueType fonts, containing multiple code pages. Internally the fonts are 16-bit encoded on the Unicode basis. In addition, with certain limitations, older fonts from Windows 3.1 may also be used. The new fonts contain a 64-bit information in the header about the individual code pages contained in the font. In the first version of Windows 95 designed for the American market the font support for fonts in other code pages was limited, it was necessary to download it from www.microsoft.com and add to the installation. These new fonts are sometimes identified in newer documentation as „Open Type“. Their full support can only be found as late as in Windows 2000.

Font selection fields have been extended to include selection of the code page identified as "script". This only holds for programs unable to find out about the code page in any other way (as for example procedures used by Word 97, Word 2000). A good examples are the standard Notepad or Write programs.

Programmers may use the option designed specifically for them and enabling selection of fonts including code page definition. There is no formal change of the parameters of the "ChooseFont“ function. The only change consists in the extension of the options of value setting into the parameter lfCharset in the LOGFONT structure, where the only information to be set in the past was the distinction between symbol and alphabet fonts. The values defined for the individual code pages again lack any obvious logic.

 

Code Page

Abbreviation

lfCharset

(dec)

lfCharset

(hex)

 

1250

CE

238

EE

 

1251

Cyr

204

CC

 

1252

West

0

00

 

1253

Greek

161

A1

 

1254

Tur

162

A2

 

1255

Hebr

177

B1

 

1256

Arab

178

B2

 

1257

Balt

186

BA

 

1258

Vietn ?

?

?

 

???

Thai

222

DE

 

1361

 

130

82

Korean Johab

Current

 

1

01

Implicit code page of the relevant Windows installation

Other values can be found in the file INCLUDE\WIN32\wingdi.h in the installed translator Borland C++, version 5.0 or newer.

More details can be found in the help file „Windows API“ – file win32.hlp – in the compilers C++ and Delphi.

 

For programmer those options mean, among other things, that for each font, in addition to the name, size and bold/italic flags, he must also store the script information – lfCharset.

 

Font Usage Compatible with Windows 3.1

In Windows 9x and NT programs the old fonts behave as if fonts of the 1252 code page. Similarly, wrongly written programs of Windows 3.1, using the lfCharset value set to 0 , will only choose the 1252 code page from the standard fonts. The Windows 3.1 programs with the lfCharset value set to 1 will automatically select the code page implicitly applicable to the relevant system.

 

Luckily, there is a solution to these incompatibilities. Both Windows 9x and NT enable definition of an alias to each known font, in which a new name and a new code page can be defined. The substitution may be defined in the following places:

·        For Windows 9x: in the file WIN.INI, section [FontSubstitutes]

·        For Windows NT: in the registration database key
HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\FontSubstitutes

 

An example of standard font substitutions in the system Windows 9x for Central Europe:

      [FontSubstitutes]

      Helv=MS Sans Serif

      Arial CE,238=Arial,238

      Arial CE,0=Arial,238

      Arial Cyr,204=Arial,204

Meaning of the above lines:

-          The font "MS Sans Serif" will in future be accessible under the new name „Helv“ too.

-          The code page 1250(WinLatin2) of the font "Arial" will in future be accessible under the new name "Arial CE" too, in the code pages 1250(WinLatin2) and 1252(WinAnsi).

-          The code page 1250(WinCyrillic) of the "Arial" will in future be accessible under the new name „Arial Cyr“, expected by the older programs.

 

Basic substitutions of fonts for a particular country are defined automatically on the Windows system installation. The substitutions can probably also be defined for the other fonts and code pages, which, however, increases the number of fonts, limited in Windows 9x. For automatic definitions of further substitutions there is for example the program WPMCP (Author: Jiří Kuchta).

 

Relation to DOS

All mentioned with Wndows 3.1 also applies here.

 

Possibility of code page identification on text entry

There is a simple, even though a little elaborate way of automatic identification of the active code page on text entry – through a test of the current keyboard structure. Each keyboard structure is detectable by program and may be used for language decoding. Then it is sufficient to create a table relating the relevant code page and the language.

 

References

Windows NT:Code Pages and Unicode

 


Unicode

The reasons explained at the beginning of this essay led to the foundation of the "Unicode Consortium".

In 1991 the Unicode Consortium was formally established after several years of informal collaboration. The purpose of the Consortium was to promote and further develop 16-bit encoding for characters of major world languages, together with a lot of historic and archaic characters. Their work has resulted in the Unicode standard, the basis for software internationalisation and naturalisation. The standard has become part of the wider ISO/EIC 10646;1993 standard.

The Unicode standard involves a 16-bit encoding scheme of fixed width, designed for character display in text. This international encoding contains characters of major world alphabets and frequently used technical signs. The Unicode encoding treats alphabet characters and other symbols in the same manner so that they may be used together. Unicode is based on ASCII, using 16 bits for character identification, though, to be able to support multilingual texts. Neither escape sequences, nor another control code are needed for any character of any language.

Development of Unicode is designed to solve two major problems, common in multilingual computer programs – font availability for different character encoding and the issue of existence of several inconsistent character sets resulting from discrepancies between national and industrial character standards.

 

Unicode Features

 

Version 2.0 of the Unicode standard contains 38.885 characters of world alphabets. It is more than sufficient for normal communication, including some older forms of many languages.  The languages currently encoded using Unicode include Russian, Arabic, Anglo-Saxon, Greek, Hebrew, Thai and Sanskrit. The unified Han subgroup contains 20.902 graphic symbols, defined as national and industrial standards of China, Japan, Korea and Tai Wan. The Unicode standard, in addition, includes mathematical operators and technical symbols (for example some geometrical shapes) and a couple of graphic symbols.

The Unicode standard includes characters of all major international standards approved and published before 31 December 1990, including above all the groups of standards ISO International Register of Character Sets, ISO/EIC 6937 and ISO/IEC 8859, but also ISO/IEC 8879 (SGML). Other primary sources include bibliographical standards used in libraries (for example ISO/EIC 5426 and ANSI Z39.64), the major national standards and various frequently used industrial standards (including character sets of Adobe, Apple, Fujitsu, Hewlett-Packard, IBM, Lotus, Microsoft, NEC, WordPerfect and Xerox). The 2.0 version also includes Hangul of the Korean national standard KS C 5601.

Inconveniences of Unicode

·        Longer text. Text translated from eight-bit encoding into Unicode is twice as long, apparently without increased informative value.  The result consumes more memory space and the consequent processing is also slower.

·        Incompatibility with eight-bit environment. Unicode text may "legally" contain characters not usually present in a "normal", eight-bit text and bearing special meaning – these mainly include binary zero, which a Unicode text may contain as the higher byte of the double-byte code. The current program code for text processing therefore cannot be used and must be thoroughly rewritten.

"Special" characters (for example characters with diacritics) form a minority in most languages. For majority of texts in those languages ASCII is sufficient, with occasional appearances of a few national characters. For such texts it is absolutely useless to use two bytes for each letter, one byte being fully sufficient. Not each medium supports binary transfer and the two following bytes for a character in Unicode are not always a convenient way of record. In the course of time several new recording standards of Unicode came into existence, some of them solving some of the problems of transfer and use of Unicode.

Basic Forms of Unicode Record

 

UCS-2 - UCS-2 is the basic Unicode character representation. The sequence of two-byte items is used for data storage. The end of a text chain may be identified for example with the 16.bit NULL, or 0x0000. Note that an eight-bit NULL (0x00) may appear in the higher or lower byte of the Unicode character number. Advantages of UCS-2 include constant character length and easy counting of the number of characters to a chain. It is therefore ideal for internal representation of Unicode characters in a program.

UTF-7 - UTF-7 is a form described with RFC-1642. This is Unicode representation of characters for primary use in electronic mail. Internet communication currently (see definition of RFC-822) only supports seven-bit ASCII, the MIME specification (RFC-2045 to RFC-2049) extending the support by selected eight-bit encoding. UTF-7 is a form only using ASCII values for record of Unicode characters and is designed for the encoded data to be easy to read for man. All supported ASCII letters are therefore not encoded and represent themselves only. Unicode characters are legible for any system. For character entry an algorithm much resembling base64 is used.

UTF-8  - This is a recommended way of representing ISO/EIC 10646 characters for UCS-2 and UCS-4. This may also be used for a Unicode representation.

1.      Compatibility with older file systems. File systems usually do not allow a zero byte and the (back) slash in filenames.

2.      Compatibility with existing programs. Record of any character should not contain ASCII unless the recorded character originally was in ASCII..

3.      Easy conversions from/into UCS.

4.      The first byte should determine the number of bytes following in the multiple-byte character record.

5.      The transformation format should not be abundant as to the number of bytes used for the representation. 

6.      In each position in the data stream the beginning of the following character unit should be easily traceable.

UTF-8 manages record of UCS values within the range 0-0x7ffffff using 1-6 bytes. The first byte invariably informs about the number of bytes used and the following components of the multi-byte sequence have the highest bit set. Any byte not beginning with 10xxxxxx is the beginning of the following character representation.

It is very convenient that for a normal ASCII text only a single byte is needed and any Unicode character may be recorded as maximum three characters in UTF-8. European languages, in addition, are extremely lucky to have national characters of relatively low Unicode values, so any Unicode character may be recorded in UTF-8 by maximum two bytes.

Unicode text recorded in UTF-8 may be handled as any other eight-bit text. Nothing needs to receive special treatment and there is no need for a special programming code for Unicode handling either (perhaps with the exception of the section of the program handling character display). UTF-8 is therefore asserting itself as the universal format for document exchange in Unicode. The UTF-8 format eliminates all inconveniences of Unicode – it keeps compatibility with the current programming code, at the same time enabling applications to use all the advantages of a universal international character set.

References

Graphic representation of the Roadmap to Plane 1 of ISO/IEC 10646 and Unicode

Unicode - cesta z chaosu kódování znaků [Way out of the chaos of character encoding]

Unicode Glossary

Unicode - Technical Introduction

Domácí stránka Unicode Consorcia - http://www.unicode.org/

Specifikace Unicode, popis jednotlivých znaků, použité oblasti hodnot [Unicode specification, description of individual characters, used areas of values]-http://www.unicode.org/unicode/standard/standard.html