Re: Arabic cursive in Unicode
- From: Mike Wright <news@xxxxxxxxxxxxxxx>
- Date: Fri, 24 Nov 2006 18:37:07 -0600
Peter T. Daniels wrote:
Mike Wright wrote:Peter T. Daniels wrote:Mike Wright wrote:In a way, the "logical characters" have no need of any concrete form. APeter T. Daniels wrote:What I'm gathering from this thread is that there are two completelyAndreas Prilop wrote:In the case of Arabic, the solution seems entirely sensible.On 21 Nov 2006, Peter T. Daniels wrote:And if Netscape 3.04 would run on this Windows XP machine, I'd still be
I'm sorry, but I have no idea what you mean. "Presentation form" is not"I'm sorry, officer, but I have no idea what you mean. 'Speed limit' is not
a term used in the study of writing systems, or of Arabic, or in
typography.
a term used in the study of writing systems, or of Arabic, or in
typography."
using it.
Did you fail to notice that the original question was about the
typography of Arabic writing?
It turned out to be some Unicode nonsense. Unicode clearly got off to
an unfortunate start, and a lot of the difficulties with Unicode result
from trying to work around the messes it got stuck with at the start.
separate groups of characters: the ones that you below call "logical
characters" and the ones I would call "allographs."
But one set of the allographs -- what Arabic grammars call "independent
forms" (i.e. unconnected on either side) -- should be identical to the
"logical characters."
"logical character" can be thought of as an abstraction that represents
the set of all forms of that "character". Isn't that what a grapheme is?
I guess that "glyph" is really too general a term for positional
variants, and allograph would be the correct term.
Yet you tell me they've given them a concrete form ...
Yes, for reference. There is no reason, however, that a font could not omit glyphs in the 0600 range. And the Mac Character Palette, could just as well display only the character name in that range--or, it could display the isolated forms drawn from in the Presentation Forms ranges, even if no fonts had glyphs in the 0600 range.
It's obvious that this is not what is actually happening with the Character Palette, but there's no reason why it couldn't be done that way. After all, none of my installed fonts have Byzantine Musical Symbols, but when I click on placeholder squares in the 1D000-1D0FF range, I see names to go with each code number, like 1D0A6 BYZANTINE MUSICAL SYMBOL MARTYRIA TRITOS ICHOS.
Given how badly the Unicode names suck, though, glyphs for reference are probably much more useful than names.
If you tell someone to write the character /ba:?/, which allograph
should they write? It depends on the context, doesn't it? Likewise, the
code 0628 does not represent an allograph, any more than does the sound
/ba:?/.
They will, of course, write the independent (isolated) form, a bowl
balanced on a dot.
Of course? Really? Regardless of context? Even if you say: "To write the word for 'sea', write /ba:?/, /Ha:?/, /ra:?/."? I think they would be making a big mistake if they didn't take context into account.
I see, by the way, that _The World's Writing Systems_ shows all four variants in all of its tables. (Not to imply that the folks behind those articles, or the book itself, are authorities.)
Of course, for reference purposes, a listing of the names of the
graphemes might show just one of the allographs. And, since every
grapheme has at least an "independent form" (which Unicode refers to as
"isolated"), that is the allograph that makes sense. The same goes for a
listing of Unicode codes. It would not be wrong, however, to show *all*
of the allographs for each character--depending on the purpose of the
listing.
It would be impossible to show _all_ the allographs (for the same
reason you can't list all the allophones -- they involve indiidual
variation).
Is it correct to say that we speak allophones, not phonemes, and that weDifferent glyphs may be required for initial, medial, final, andYou haven't defined which of your levels is "presentation characters,"
isolated forms of each character. A font must store all those glyphs,
and software must be able to specify the appropriate glyphs based on
context.
However, it doesn't make sense for the user to have to learn to type up
to four different glyphs for each logical character. It's simpler--and
faster--for the user to just type the logical characters and for the
software to figure out the appropriate glyphs based on context.
Also, from the standpoint of software, it's probably best to store the
characters in a glyph-independent format. This makes it possible for
software to quickly switch between a typewriter style, stringing one
glyph after another on a line, and a style that's a bit like
handwriting, with a more diagonal stacking of glyphs within words
(assuming a font that supports the difference).
Anything that requires parsing of Arabic strings should operate on the
logical characters, rather than on glyphs that might represent two or
more characters, so it makes more sense to store the logical characters
from this standpoint, as well.
and "logical characters" is a very strange term.
Can you do it with "grapheme" and :"allograph"?
write allographs, not graphemes?
Yes
If so, then "grapheme" works for "logical character", and this is what
is covered by the 0600 set. (I was just following the original poster in
using that term.) "Allograph" works for the actual forms, which are
covered by the two "Presentation" sets.
So (I'm looking at Unicode Version 1 Manual vol. 1, pp. 218 and 552)
0600 0628 and 0600 FE8F are identical in form, different in function.
Was that an efficient way to do it?
Yes. A typist using Arabic-QWERTY input on the Mac can type the <b> key on the keyboard and the software will see it as 0628. Then it will cause the display of one of the four glyphs, FE8F, FE90, FE91, or FE92, based on context. It's fast and efficient.
Otherwise, the typist would have to take responsibility for producing the appropriate contextual variant. It would be necessary to use four different keystrokes or keystroke combinations for each grapheme. The chances of typos would be greatly increased, it would be harder to learn to type in Arabic, and typing would probably be much slower than it is now. Try using a standard US keyboard input method to type some text where three out of four characters require the shift key, the option key, or the command key--in no particular sequence. I don't think you'd find it efficient.
You might think that they could just use the codes for the various isolated forms in the Presentation Forms sets for input and storage, but there are likely to be some inefficiencies associated with that.
From a programming standpoint, it's very easy to look at a character code and know that, because it is greater than 05FF and less than 0700, it represents a grapheme that needs to be converted for display. Otherwise, some kind of table of which codes are doing double duty as grapheme and allograph would be required, and it might require constant reference to that table while processing text. It might even require the maintenance of separate graphemic and allographic strings in memory, which could be orders of magnitude less efficient in terms of memory usage than maintaining a separate set of codes for graphemes. (Whether to place actual glyphs at those codepoints in a particular font is a font designer decision, and has nothing to do with Unicode as such.)
In programming, comprehensibility commonly turns out to be much more important to overall efficiency than either economy of storage or apparent simplicity of structure. In the long run, ambiguity is dangerous and leads to errors--which can be even worse than mere "inefficiency". Separating the graphemic from the allographic reduces ambiguity, increases clarity, and is, thus, more efficient in the long run.
If I had to write text-handling software with Arabic input, storage, and display capabilities, I'd prefer the current Unicode approach over anything else I can come up with.
Do <A> and <a> have some platonic representation of "first roman
letter" with the appropriate variant chosen contextually?
I had to jump ahead a couple hundred pages in Gleason to find out about this. He says (pg. 410) that "capitalization is in many ways comparable to a 'suprasegmental' phoneme." Based on his treatment of Greek ∑, it appears that we can say that A consists of two graphemes <a> and <capitalization>. Does this mean that there is no <A> grapheme? Or, can one grapheme consist of two other graphemes? I'm confused. (The sci.lang Introduction to Descriptive Linguistics course is slow, but it's free.)
So, do we say that ﺑ is <ﺏ> plus <initial>, ﺒ is <ﺏ> plus <medial>, and ﺐ is <ﺏ> plus <final>? Do we avoid saying that ﺏ is <ﺏ> plus <isolated> by providing a general rule that the citation form is the isolated form? This seems like a reasonable approach in linguistics, but not necessarily in computer science.
The "Presentation Forms-B" set seems to be mostly the standard
positional allographs required for the basic Arabic alphabet. There are
a few exceptions, such as variations on alif and laam-alif. I tend to
think of laam-alif as a ligature.
The "Presentation Forms-A" set seems to cover the allographs of the
graphemes of non-Arabic languages, as well as a number of ligatures. For
example:
FBA0 ARABIC LETTER RNOON ISOLATED FORM
F8D4 ARABIC LETTER NG FINAL FORM
FBB1 ARABIC LETTER YEH BARREE WITH HAMZA ABOVE FINAL FORM
FC0B ARABIC LIGATURE TEH WITH JEEM ISOLATED FORM
I hadn't realized how many non-Arabic characters there are. I'd love to
see a single listing of what sounds they represent in various languages.
The Unicode names do tend to suck. I have no idea what "RNOON" or
"PEHEH" might refer to.
See The World's Writing Systems for all the languages that use Arabic
script nowadays, with a couple extra.
Right. Once I knew to look under Sindhi for FBA0 and FC0B, I was able to spot them. It's rather inconvenient, though.
So, can you add a single chart with Unicode numbers to the next edition? (And will there be a good upgrade price for current owners?)
--
Mike Wright
http://www.raccoonbend.com
.
- Follow-Ups:
- Re: Arabic cursive in Unicode
- From: Richard Wordingham
- Re: Arabic cursive in Unicode
- From: Nigel Greenwood
- Re: Arabic cursive in Unicode
- References:
- Arabic cursive in Unicode
- From: Danny
- Re: Arabic cursive in Unicode
- From: Andreas Prilop
- Re: Arabic cursive in Unicode
- From: Danny
- Re: Arabic cursive in Unicode
- From: Peter T. Daniels
- Re: Arabic cursive in Unicode
- From: Danny
- Re: Arabic cursive in Unicode
- From: Peter T. Daniels
- Re: Arabic cursive in Unicode
- From: Andreas Prilop
- Re: Arabic cursive in Unicode
- From: Peter T. Daniels
- Re: Arabic cursive in Unicode
- From: Mike Wright
- Re: Arabic cursive in Unicode
- From: Peter T. Daniels
- Re: Arabic cursive in Unicode
- From: Mike Wright
- Re: Arabic cursive in Unicode
- From: Peter T. Daniels
- Arabic cursive in Unicode
- Prev by Date: Re: Arabic cursive in Unicode
- Next by Date: Re: Armenia, homeland of the Etruscans?
- Previous by thread: Re: Arabic cursive in Unicode
- Next by thread: Re: Arabic cursive in Unicode
- Index(es):
Relevant Pages
|