Re: Arabic cursive in Unicode



Peter T. Daniels wrote:
Mike Wright wrote:
Peter T. Daniels wrote:
Mike Wright wrote:
Peter T. Daniels wrote:
Andreas Prilop wrote:
On 21 Nov 2006, Peter T. Daniels wrote:

I'm sorry, but I have no idea what you mean. "Presentation form" is not
a term used in the study of writing systems, or of Arabic, or in
typography.
"I'm sorry, officer, but I have no idea what you mean. 'Speed limit' is not
a term used in the study of writing systems, or of Arabic, or in
typography."
And if Netscape 3.04 would run on this Windows XP machine, I'd still be
using it.

Did you fail to notice that the original question was about the
typography of Arabic writing?

It turned out to be some Unicode nonsense. Unicode clearly got off to
an unfortunate start, and a lot of the difficulties with Unicode result
from trying to work around the messes it got stuck with at the start.
In the case of Arabic, the solution seems entirely sensible.
What I'm gathering from this thread is that there are two completely
separate groups of characters: the ones that you below call "logical
characters" and the ones I would call "allographs."

But one set of the allographs -- what Arabic grammars call "independent
forms" (i.e. unconnected on either side) -- should be identical to the
"logical characters."
In a way, the "logical characters" have no need of any concrete form. A
"logical character" can be thought of as an abstraction that represents
the set of all forms of that "character". Isn't that what a grapheme is?
I guess that "glyph" is really too general a term for positional
variants, and allograph would be the correct term.

Yet you tell me they've given them a concrete form ...

Yes, for reference. There is no reason, however, that a font could not omit glyphs in the 0600 range. And the Mac Character Palette, could just as well display only the character name in that range--or, it could display the isolated forms drawn from in the Presentation Forms ranges, even if no fonts had glyphs in the 0600 range.

It's obvious that this is not what is actually happening with the Character Palette, but there's no reason why it couldn't be done that way. After all, none of my installed fonts have Byzantine Musical Symbols, but when I click on placeholder squares in the 1D000-1D0FF range, I see names to go with each code number, like 1D0A6 BYZANTINE MUSICAL SYMBOL MARTYRIA TRITOS ICHOS.

Given how badly the Unicode names suck, though, glyphs for reference are probably much more useful than names.

If you tell someone to write the character /ba:?/, which allograph
should they write? It depends on the context, doesn't it? Likewise, the
code 0628 does not represent an allograph, any more than does the sound
/ba:?/.

They will, of course, write the independent (isolated) form, a bowl
balanced on a dot.

Of course? Really? Regardless of context? Even if you say: "To write the word for 'sea', write /ba:?/, /Ha:?/, /ra:?/."? I think they would be making a big mistake if they didn't take context into account.

I see, by the way, that _The World's Writing Systems_ shows all four variants in all of its tables. (Not to imply that the folks behind those articles, or the book itself, are authorities.)

Of course, for reference purposes, a listing of the names of the
graphemes might show just one of the allographs. And, since every
grapheme has at least an "independent form" (which Unicode refers to as
"isolated"), that is the allograph that makes sense. The same goes for a
listing of Unicode codes. It would not be wrong, however, to show *all*
of the allographs for each character--depending on the purpose of the
listing.

It would be impossible to show _all_ the allographs (for the same
reason you can't list all the allophones -- they involve indiidual
variation).

Different glyphs may be required for initial, medial, final, and
isolated forms of each character. A font must store all those glyphs,
and software must be able to specify the appropriate glyphs based on
context.

However, it doesn't make sense for the user to have to learn to type up
to four different glyphs for each logical character. It's simpler--and
faster--for the user to just type the logical characters and for the
software to figure out the appropriate glyphs based on context.

Also, from the standpoint of software, it's probably best to store the
characters in a glyph-independent format. This makes it possible for
software to quickly switch between a typewriter style, stringing one
glyph after another on a line, and a style that's a bit like
handwriting, with a more diagonal stacking of glyphs within words
(assuming a font that supports the difference).

Anything that requires parsing of Arabic strings should operate on the
logical characters, rather than on glyphs that might represent two or
more characters, so it makes more sense to store the logical characters
from this standpoint, as well.
You haven't defined which of your levels is "presentation characters,"
and "logical characters" is a very strange term.

Can you do it with "grapheme" and :"allograph"?
Is it correct to say that we speak allophones, not phonemes, and that we
write allographs, not graphemes?

Yes

If so, then "grapheme" works for "logical character", and this is what
is covered by the 0600 set. (I was just following the original poster in
using that term.) "Allograph" works for the actual forms, which are
covered by the two "Presentation" sets.

So (I'm looking at Unicode Version 1 Manual vol. 1, pp. 218 and 552)
0600 0628 and 0600 FE8F are identical in form, different in function.
Was that an efficient way to do it?

Yes. A typist using Arabic-QWERTY input on the Mac can type the <b> key on the keyboard and the software will see it as 0628. Then it will cause the display of one of the four glyphs, FE8F, FE90, FE91, or FE92, based on context. It's fast and efficient.

Otherwise, the typist would have to take responsibility for producing the appropriate contextual variant. It would be necessary to use four different keystrokes or keystroke combinations for each grapheme. The chances of typos would be greatly increased, it would be harder to learn to type in Arabic, and typing would probably be much slower than it is now. Try using a standard US keyboard input method to type some text where three out of four characters require the shift key, the option key, or the command key--in no particular sequence. I don't think you'd find it efficient.

You might think that they could just use the codes for the various isolated forms in the Presentation Forms sets for input and storage, but there are likely to be some inefficiencies associated with that.

From a programming standpoint, it's very easy to look at a character code and know that, because it is greater than 05FF and less than 0700, it represents a grapheme that needs to be converted for display. Otherwise, some kind of table of which codes are doing double duty as grapheme and allograph would be required, and it might require constant reference to that table while processing text. It might even require the maintenance of separate graphemic and allographic strings in memory, which could be orders of magnitude less efficient in terms of memory usage than maintaining a separate set of codes for graphemes. (Whether to place actual glyphs at those codepoints in a particular font is a font designer decision, and has nothing to do with Unicode as such.)

In programming, comprehensibility commonly turns out to be much more important to overall efficiency than either economy of storage or apparent simplicity of structure. In the long run, ambiguity is dangerous and leads to errors--which can be even worse than mere "inefficiency". Separating the graphemic from the allographic reduces ambiguity, increases clarity, and is, thus, more efficient in the long run.

If I had to write text-handling software with Arabic input, storage, and display capabilities, I'd prefer the current Unicode approach over anything else I can come up with.

Do <A> and <a> have some platonic representation of "first roman
letter" with the appropriate variant chosen contextually?

I had to jump ahead a couple hundred pages in Gleason to find out about this. He says (pg. 410) that "capitalization is in many ways comparable to a 'suprasegmental' phoneme." Based on his treatment of Greek ∑, it appears that we can say that A consists of two graphemes <a> and <capitalization>. Does this mean that there is no <A> grapheme? Or, can one grapheme consist of two other graphemes? I'm confused. (The sci.lang Introduction to Descriptive Linguistics course is slow, but it's free.)

So, do we say that ﺑ is <ﺏ> plus <initial>, ﺒ is <ﺏ> plus <medial>, and ﺐ is <ﺏ> plus <final>? Do we avoid saying that ﺏ is <ﺏ> plus <isolated> by providing a general rule that the citation form is the isolated form? This seems like a reasonable approach in linguistics, but not necessarily in computer science.

The "Presentation Forms-B" set seems to be mostly the standard
positional allographs required for the basic Arabic alphabet. There are
a few exceptions, such as variations on alif and laam-alif. I tend to
think of laam-alif as a ligature.

The "Presentation Forms-A" set seems to cover the allographs of the
graphemes of non-Arabic languages, as well as a number of ligatures. For
example:

FBA0 ARABIC LETTER RNOON ISOLATED FORM
F8D4 ARABIC LETTER NG FINAL FORM
FBB1 ARABIC LETTER YEH BARREE WITH HAMZA ABOVE FINAL FORM
FC0B ARABIC LIGATURE TEH WITH JEEM ISOLATED FORM

I hadn't realized how many non-Arabic characters there are. I'd love to
see a single listing of what sounds they represent in various languages.
The Unicode names do tend to suck. I have no idea what "RNOON" or
"PEHEH" might refer to.

See The World's Writing Systems for all the languages that use Arabic
script nowadays, with a couple extra.

Right. Once I knew to look under Sindhi for FBA0 and FC0B, I was able to spot them. It's rather inconvenient, though.

So, can you add a single chart with Unicode numbers to the next edition? (And will there be a good upgrade price for current owners?)

--
Mike Wright
http://www.raccoonbend.com
.



Relevant Pages

  • Re: Arabic cursive in Unicode
    ... separate groups of characters: the ones that you below call "logical ... But one set of the allographs -- what Arabic grammars call "independent ... the "logical characters" have no need of any concrete form. ... A font must store all those glyphs, ...
    (sci.lang)
  • Re: Arabic cursive in Unicode
    ... separate groups of characters: the ones that you below call "logical ... Of course, for reference purposes, a listing of the names of the graphemes might show just one of the allographs. ... A font must store all those glyphs, ... faster--for the user to just type the logical characters and for the ...
    (sci.lang)
  • Re: If you could add anything you want
    ... The Japanese don't write their characters exactly the same way as the Chinese do and vice versa. ... Some people aren't too happy that the example glyphs are drawn the "wrong" way. ... There'd be no way to express /what/ the standard was standardising. ...
    (comp.lang.java.programmer)
  • X.EXEs virtual keyboard.
    ... // covering 1.3 thousand glyphs. ... SetTextColor(DC, * Hue); ... // If our surface of 100 monospaced characters is used up, ... Paint_Maybe { ...
    (microsoft.public.vc.ide_general)
  • Re: -eme and related suffixes
    ... > distinctive segmental letters, each of which has a variety of differing ... > up enough excuse to use the word "grapheme"? ... where are the "graphemes": the characters? ... >> which is the (innate, human) characteristic that is encapsulated in ...
    (sci.lang)

Quantcast