Re: Chinese character & pinyin frequency analysis



"LEE Sau Dan" <danlee@xxxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message news:87wstnq5ox.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
"Richard" == Richard Wordingham <jrw0602@xxxxxxxxxxx> writes:

>> It's a file or worflow management issue. We always of lots of
>> files and copies (real ones or virtual) around. Having them
>> isn't a problem. Not managing them properly is.

>> Nobody has ever complained that after writing a C program, one
>> has to compile it and that generates a second file---the
>> executable. (Intermediate-level programs may need to use a
>> third, intermediate file---object code, too.) Why?

Richard> Because there is little temptation for maintenance to be
Richard> done on the object or executable files.

Then, remove that temptation in a similar way! :)

But an HTML final product is usually relatively easy to maintain. I don't have the option of generating the HTML on the fly, which is the nearest analogy I can think of to having binary executable code as the final product.

>> If you're using the native Win32 port of Emacs, then Unix font
>> specs. are irrelevant. The Win32 port should be using native
>> Windows fonts. You may need to consult the manual, though, to
>> find out the details.

Richard> Does one exists for the Win32 port?

You may try the Emacs wiki when looking for answers. Or try googling.

I've been googling, but I'll give the wiki a go.

Richard> There also definitely seems to be a problem with
Richard> proportional fonts.

No problem on me. (Emacs 21.4.1 on Linux)

Richard> The word น้ำ is unrecognisable after typing it,

It displays well. And when I slide the cursor across it, the cursor
does go up to the tone mark appropriately!

That's quite different to the Windows port (of Emacs 22.1.1), which emulates a cell-based display device when one is editing a line. I think the problem here is that emacs is generating the word in two halves - <no nu, mai tho> and <sara am>. Now, Microsoft-approved fonts have two general (though I think not universal features):

1) Uniscribe rearranges the text (with help from Uniscribe, at least, if you have 'complex scripts' enabled) as <no nu, nikkhahit, mai tho> and <sara aa>. The effect is subtle, though emulating it might help with the next problem.

2) Windows XP Fonts primarily intended for use with Thai have an error indication method for sara am at the start of a 'word'. They replace it by a black oblong that lurks in the private use area. I think that what happens when I enter sara am is that is that it is rendered as an isolated word, and is falling foul of this error indication. You might complain that Microsoft is not treating in accordance with its Unicode classification of a free-standing uncased letter (Lo), but the THAI CHARACTER NIKHAHIT itself was once classified as a letter rather than a non-spacing mark.

Richard> though repasting the line does cause it to display
Richard> properly.

Try Ctrl-L.

Thanks. That's useful advice.

The use of fixed width fonts seems to have its own potential nightmares. There are some pretty weird column widths being reported by wcwidth() - Solaris 2.10's UTF-8 locales report lower case vowels with caron as being 2 columns wide! Caron is the only accent that has this effect. I can only assume it is because they're being taken from a font intended for Chinese use of pinyin! I strongly suspect the corresponding wcswidth() does *not* respect canonical equivalence!

I've a suspicion as to what is fouling up my entry of Khmer and Lao. Windows XP insists that I associate a locale with an input method, and for these languages I used Catalan and Latvian respectively. It seems that something somewhere that Emacs is heeding doesn't believe that Latvian text can include letters of the Lao script and that Catalan text cannot include Khmer letters. Certainly ASCII characters for Lao and Khmer are getting in through the input editor. Possibly the solution is to define Emacs input editor for those scripts. (I'm going to have to automate my keyboard definitions - I'm heading for having a Microsoft keyboard layout, a web page and now an Emacs definition!)

Given that there is an Emacs tutorial (C-u C-h T) in Thai pretty early
in Emacs 20 (when the other 'translations' are English and Japanese
only), I believe there are serious Thai users of Emacs. So, the
support can't be that inadequate.

I'm now totally baffled as to what is happening with fonts. I found out how to get proportional fonts in the menu, and tried out Code2000 just to get the pasted-in Khmer and Lao to display. Suddenly Thai input worked tolerably - combining characters don't appear until I use Ctrl-L, but thereafter I can move the cursor through them without trashing the display. I think the font's GPOS table isn't being used until I hit Ctrl-L - I see occasional traces of combining characters off to the right of the base character. I'm not getting any ligation in 'Ca‍esar' (though I do with Code2000 in Notepad), but at least ZWJ is not displaying as a space.

Richard.

.