Multilingual Support

=== Top of the Swiki === Attachments ===

Multilingual Support

Note: Before making changes to this page, please read theguidelines to SqueakCentralProjects. Thanks.

Goal
The goal is to provide the multibyte character and multilingualhandling support for Squeak. Because the XML specification require ISO-10646support, this project may be the basis for the XML processors written in Squeak.

The experimental version is available at http://www.is.titech.ac.jp/~ohshima/squeak/.

Estimated completion date: March 31st, 2000

Status
[August 14] project declared to exist

Implementors: Yoshiki Ohshima
Q/A, Integration:

General Commentary
A list of items (most recent last) of the form[date, author] commentAnyone is invited to add to this list.The implementors are free to remove and summarize any items more than two weeks old.

[082599, Mark Wai] Based on past discussion on this thread, I have not convinced nor understand which direction we want to go with Multilingual support in Squeak yet. I suggest that we should define a clear goal (e.g. how do we define multi-lingual support in an environment) before we blury our head too deep into implementation.

[Aug 30 '99, Yoshiki Ohshima]As a "Central" Project, I think that the changes should be somewhat modest (although it may be too conservative to move to another "plane"). So what I'm thinking now is: Character, String, Symbol are kept untouched and add multi-byte characters for literals. And to keep the "dot-identical" property of Squeak, the character appearance (glyph) should not depends on the external information such as locale or environment setting. This means the internal encoding should carry those information and it should be an aggregation of the domestic encodings.

However, Unicode dominance cannot be neglected: so I'm also thinking that include Unicode as one of the "encodings" and provide the conversion from/to Unicode.

Any other ideas and comments?

[1999-09-12 Todd Blanchard] I'm looking into using Squeak to do some multi-lingual stuff and I need more encoding support than just JIS. I think it would be worthwhile to steal a page from the NextStep and Java people and standardize on UTF-8 as the persistent storage format and use UCS2 as the in-memory format. UCS2 has the nice property of having fixed sized characters (2 bytes) which simplifies in-memory manipulations. UTF-8 has the advantage of being single-byte in the ASCII range and so reading ASCII in with a UTF-8 read routine works fine. If we use unicode as a pivot - then resources for converting to/from various other encodings are readily available.

This also implies that we would want a small family of String subclasses with String itself being an abstract class. I'm Looking at the implementation of String to see how hard it would be to replace all current String instances with a subclass of String called ASCIIString. Can anybody tell me how big a "word" is in Squeak?

[23-Sep-99/ hh] Summary of discussion on the Squeak mailing list: Unicode support

[11-Oct-99/ hh] Screenshots of Multilingual Support - First Version

[29-Dec-99/ Paolo Bonzini] Hmmm. I'm a bit late, but regarding the note above on `ASCIIString' I'd think it would be ok to have String implement a generic 8-bit encoding, and have a CharacterArray as the abstract superclass of String, UnicodeString, JisString. Maybe MultiString as an abstract superclass of JisString, Big5String etc...