Multilingual Support - Notion of a word

=== Top of the Swiki === Attachments ===

Multilingual Support - Notion of a word

[23-Sep-1999 / hh] The following email states that the notion of a might not be obvious in all languages. (not yet summarized)

Title Unicode support
Author Todd Blanchard

Subject: Re: RE: Unicode support
Date: Wed, 22 Sep 1999 13:57:02 -0700
From: Todd Blanchard

> > Uh, some languages do not have delimiters between words at all.
> Neat. Hebrew, of course, is not one of them. (What languages apart from
> ideographic languages don't use delimiters?) Further, such a class doesn't
> need the notion of tokenization, by definition, I suppose, unless
there are
> END-OF-TOKEN forms of letters, in which case the model of tokens
suggested would
> not be applicable. They would, of course, resolve that issue in a subclass
> implementation that ignores the token parameter.

German has the lovely habit of running multiple words together to
make bigger and bigger words. You often want to navigate on the
consituent words in the superword. There are software text editors
that do this correctly. Some eastern asian languages also have
different ways of busting up things into words that don't relate to
whitespace.

> > But rather than get pedantic about that - lets divorce "tokenization"
> > from the concept of word-spotting.
>
> Why? I think the point that was made here by othersd, and with
which I agree, is that
> tokenization is an appropriate string operation, and semantic
> "word-spotting" is probably not.

I don't agree.

String is a mechanism for representing language and languages
typically have words. Tokens are something else - more
arbitrary. We got here because of this:

> A newbie recently asked how to compute the equivalent of:
>
> word 4 of line 7
>
> and
>
> set word 4 of line 7 to "foobar"

Which I think is certainly an appropriate operation for String.

OTOH, this operation cannot always be as simpleminded as delimiter
based tokenization.

Although - by your definition of what a string is - perhaps thats
not appropriate.

>
> > hebr3 "This is a fine mess" hebr2 hebr1 hebr0
> > > if you iterate through the tokens, in what order would you >
expect to get
> tokens?
>
> You are imposing a particular sequencing of information onto the
string based
> upon the semantics of an underlying language, and then asking me
to describe the
> tokens that are derived. My suggestion is that if you want the
tokens to be
> semantically meaningful, then your program (or a subclass of
string) must first
> organize the sequence of characters so that the purely mechanical,
> non-semantic sequencing will yield a meaningful result. Thus, the
answer is
> this: they would be the tokens, read left to right or right to
left in sequence, as
> defined by the delimiters.

Which is probably useless for anything but single-direction languages.

> It is not the duty of the String object to understand the
semantics of the
> underlying language in which characters are represented, but only
to provide
> underlying operations in which most reasonable operations (including
> semantics-based operations) might be accomplished.

Well which is it? String is either a class for representing chunks
of languages, or its a mechanism for representing arrays of
characters (whatever those are - a whole other topic).
I think string is implemented as the latter but used as the former
and we english speakers are lucky in that these just happen to
coincide. Unfortuneatly, the coincidence is a rather lucky fluke
with english and not something you can rely on globally.

eTranslate, Inc. The Power of Language
Todd Blanchard main +1.415.487.7850
Chief Technology Architect fax +1.415.371.0010
http://www.etranslate.com/

From: "Peter William Lount"
Subject: Re: Unicode support
Date: Tue, 21 Sep 1999 20:03:11 -0700

Arrays are not suitable for GeneralStrings because they are not easily
expandable in size. Strings must be growable in size.

OrderedCollection
would be a better bet on being like a GeneralString or even a GeneralString
super class.

>Am I misunderstanding Peter's suggestions, or did he
>mean to say 4-bytes per CHARACTER in a string,
>as opposed to 1 byte per string. As understood,
>all string objects are themselves objects: it is only
>their contents that are byte data.

Yes I meant to clearly say 4 bytes per Character object in a string
contrasted with 1 byte per byte character code in a string. The 4 bytes are for the object pointer to the character objects (or any other object that understands the character protocols which can make things interesting). The character objects themselves would take up the normal space that any object takes with it's instance variables.

It's interesting that Mike Klein mentions "words". One of the reasons to move to a fully object model for strings is that it would allow for
"objects" of many kinds to be put into strings as long as they understand "character object protocols". For example, a "word object" that represents "hi", a character object that represents the "space" character, and a word object that represents "there" would occupy only three 32 bit pointers or 12 bytes in an object based string. In a byte string these would occupy 8 characters. Byte encoding wins the space race with short strings. But as you add words to the "symbol set" or "dictionary of words" the effeciency of storing "words" in strings would mean that "generic strings" could use less storage than "byte encoded strings". The storage space saved could be quite large.

In a sense strings or objects that behave as strings or characters could be nested within strings. One of the protocols that "character objects" would need to respond to is "bottom out in characters" - that is replace yourself with the characters that represent you (thus eliminating the nesting and flattening out the string). The concept of a word is just a group of
characters that may or may not have some meaning to us humans.

Simply put by using "word objects" a generic string object can be made to be much more space efficient than any potential character based string - byte or object oriented. Thank's for reminding me of this one Mike.

Furthermore, word objects work fine with any human language that forms
"characters" into words. Why stop at words when clumps of words or phrases and entire sentences and paragraphs could be nested and nested and ....

Another advantage of storing words in strings is that they are already in "token" format and might provide some time savings for parsing and lexical processing.

Yet another idea for words is the "dynamic words" I mentioned in an earlier message on this topic. A place holder "word" object could be put into a string. This place holder object is actually a "variable word" that is
linked to some "source" which supplies the "characters" that make up the "dynamic variable word". Say a "total amount" for an "invoice". So when the string is displayed it shows the "current total amount" as number characters provided by the "accounting source object" when the "dynamic variable word" is asked for it's characters. Formatting information could also be present in an "environment" that is passed into the "display" method. This would allow for different formatting based on "preferences or localizations based on the country or language choices of the user". If the user clicks on the "characters" that make up this "dynamic variable word" the system could find that they are really clicking on the "total amount" word object which is linked into the accounting objects in the system. This allows the user and the system to quickly get to the objects behind the "graphical user interface" and activate appropiate windows or whatever....

There are also advantages to storing string information in a hierarchical format in a general object based string. This nesting is what XML and HTML essentially do. The string form is their flattened form, while a hierarchy form exists that lets you manipulate the structure represented by the "flattened string". Some food for thought as this could impact Smalltalk's success as an internet solution tool.

Actually, reflecting on this it might even be more space efficient for a PDA minimal footprint version of Smalltalk to use object based strings with "character and word objects" rather than just a byte encoded string approach. We will have to get our calculators out to test this idea.

Certainly Perl is powerful and is often touted as a "powerful string
manipulation language". It's name is from "Practical Extraction and Report Language". Extraction of what? Text. All most all of the Web technologies like HTTP, HTML, SGML, XML, FTP, SMTP, etc... are "text" or "string" based technologies. String manipulation languages and sub languages like Perl and it's "regular expressions" are very effective in dealing with these Internet technologies. I would consider these essential to any power string object.

Text parsing is also an area that can assist with Web technologies.

By strengthing and expanding Smalltalk's abilities to work concisely with
"text" information we can improve it's success and usefullness in
implementing web and internet solutions. One of the reasons for Perls
success is concise string manipulation.

All the best,

Peter William Lount
peter@smalltalk.org
http://www.smalltalk.org

p.s. Mike, what is WordNet?

p.s.s. From the Perl 1.0 man page.

NAME perl | Practical Extraction and Report Language

DESCRIPTION
Perl is a interpreted language optimized for scanning arbi- trary text
files, extracting information from those text files, and printing reports
based on that information. It's also a good language for many system
management tasks. The language is intended to be practical (easy to use,
effi- cient, complete) rather than beautiful (tiny, elegant, minimal). It
combines (in the author's opinion, anyway) some of the best features of C,
sed, awk, and sh, so people familiar with those languages should have
little difficulty with it. (Language historians will also note some
vestiges of csh, Pascal, and even BASIC|PLUS.) Expression syntax
corresponds quite closely to C expression syntax. If you have a problem
that would ordinarily use sed or awk or sh, but it exceeds their
capabilities or must run a little fas- ter, and you don't want to write the
silly thing in C, then perl may be for you. There are also translators to
turn your sed and awk scripts into perl scripts. OK, enough hype.