Re: Arabic cursive in Unicode



Danny wrote:
Ruud Harmsen wrote:
21 Nov 2006 20:33:38 -0800: "Peter T. Daniels"
<grammatim@xxxxxxxxxxx>: in sci.lang:

I mean exactly what I think you think I mean :) I mean the four (or
two, or one) variants of a letter (or ligature) that are displayed
according to its position within a word. I'm pretty sure the term is
the one used by Unicode - I was reminded to use it by Andreas' post.
I'm sorry, but I have no idea what you mean. "Presentation form" is not
a term used in the study of writing systems, or of Arabic, or in
typography.
But it is in Unicode. http://rudhar.com/lingtics/uniclnks.htm
0600 is Arabic, FB50 is Arabic Presentation Forms-A and FE70 is Arabic
Presentation Forms-B. The question may be: should these be used, and
if so, how?

As far as I can see, the answer is that the logical (0600) characters
should be used for data storage (eg a digital file) while the
presentation forms should be used for display (print or screen). The
logical characters only have a visual form for convenience.

As you say, the 0600 range of codes represent abstract "characters", but not concrete glyphs. In Arabic, glyph forms are context-dependent. The "Presentation forms" provide glyphs for pretty much the full variety of expected contexts.

Input methods must deal with this. For example, on the Mac, as you type an Arabic character, you see an isolated form or a final form.

If the preceding character is non-alphabetic, you see the isolated form. Otherwise, you see the final form, which connects to the preceding character if appropriate.

When another alphabetic character is typed, the previous final form changes automatically to a non-final form (if such a form exists for that character). The newly-typed character will be a final form.

In TextEdit on the Mac, the glyphs that appear on screen are from the Presentation Forms-B list.

However, when the text is stored as plain text, what is stored is in the 0600 range. This means that the display software has to look at the context of the "logical character codes" and select the appropriate "presentation glyphs" every time the text has to be re-displayed.

It would be up to the software whether to give the user the option of displaying the ligatures from the Presentation Forms-A list. I suspect that this kind of option would only be available in specialized Arabic-language input software. On the Mac, the Character Palette can be used to input glyphs.

It's worth mentioning that not all fonts handle the Presentation Forms-A ligature glyphs as I would have expected. For example, Unicode character FC0B "Arabic Ligature Teh With Jeem Iolated Form" shows the "Teh" inverted above the "Jeem" in the Al Bayan font family, but not in the Geeza Pro family, which shows the same thing you'd get from the Teh-Jeem combination in Presentation Forms-B.

If you are dealing with keyboard input that is not handled by your OS, what you probably need is a set of context rules for converting a string of 0600-range codes into FE70-range codes for display. If you are just storing unmodifiable strings of canned text for display, then you could probably store the FE70-range codes, once you figure out which ones are appropriate.

Where is your Arabic text coming from in the first place? You say elsewhere: "It wouldn't be practical to get a native speaker of every language we want to support." How much text are you talking about?

--
Mike Wright
http://www.raccoonbend.com
.



Relevant Pages

  • [PATCH] console UTF-8 fixes
    ... I send a patch to the UTF-8 part of the vt driver. ... If a certain character is not found in the glyph ... characters) is to simply display the glyph loaded in that position. ...
    (Linux-Kernel)
  • Where has the plain acute character gone
    ... I am desperately trying to make bash (or xterm or konsole using ... bash) display an "acute" ... one could display such a character using the ...
    (Debian-User)
  • Re: Soft-hyphens or breakable points in a string
    ... > The Unicode line breaking rules define "@" as belonging to line breaking ... The "For purposes of display" would appear to rule out the original ... any line-break must be followed by a whitespace character ). ... > quite some trouble when it doesn't. ...
    (comp.infosystems.www.authoring.html)
  • Re: Is threading the right solution for this challenge?
    ... The CPU will not cycle out of control on the display thread, ... WITH TIMEOUT would be fairly ... "I've got a cart with a donkey which has a HUGE ... purely a Character Mode display. ...
    (comp.lang.cobol)