Re: Chinese character & pinyin frequency analysis



"LEE Sau Dan" <danlee@xxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
"Richard" == Richard Wordingham <jrw0602@xxxxxxxxxxx> writes:

The real question is: why bother doing it manually? And why
bother doing it at all?

Richard> 1. Older text editors aren't much good at mixing fonts.

Then, upgrade your editor. If you're serious enough to use Chinese
characters, you should be using one that does it properly.

Microsoft recently upgraded Notepad, at least for Windows XP users. It now searches one's fonts for characters not in the font you are using, so that problem has *now* largely gone away.

It wasn't a problem with Chinese characters, though Windows Vista apparently includes some horrible bodges to get round the TrueType limit of 64K glyphs per font. The problem I had was with mixing Khmer and IPA.

Richard> 2. Some of the time I use these codes (jargon: character
Richard> entities) to check the interpretation of control-like
Richard> characters, such as ligature controls, or to effect a
Richard> choice of normalisation. These would not readily show up
Richard> in most editors.

Such things should be done by programs (and programmers debugging
their programs). You don't use a hex-editor to create/check your
files, do you? Then, why check the unicode?

So which editor do you suggest? And sometimes I do resort to hex dumps to find out what characters are in a piece of text, though Windows 2002 and later offers an alternative method - so long as one hasn't had to resort to a hack font. And can I sure be an editor will not normalise my input?

I have, very occasionally, resorted to fixing Word files by editing the RTF files as plain text.

And, yes, I do resort to editing binary files when the need arises - the worst case was having to edit a VAX object file to initialise an additional register.

Richard> 3. It can be tempting to compact text by using a legacy
Richard> encoding. There are also message boards where characters
Richard> will get misinterpreted - I have had to enter accented
Richard> letters as character entities to avoid them being
Richard> misinterpreted according to a legacy code.

Use an editor that can do that automatically. :)

If you have one to hand. Using a legacy encoding for compacting will also result in one's having a pair of source and derived files, and possible problems if the editor is not clever enough to convert the character encoding declaration.

Richard> 4. There are a few characters that are best entered in
Richard> HTML text as character entities ('<', '&' and
Richard> multi-character white space immediately spring to mind),
Richard> though there are symbolic names for these.

Again, these ought to be delegated to a decent editor.

So which cheap editor do you suggest for HTML incorporating ECMA-script ('javascript')?

Richard> 5. It's a lot quicker to type '&#331;' for eng than to
Richard> fiddle about with keyboard selections.

What are "keyboard selections"?

Selecting keyboard layouts. For small scripts (or language systems using small subsets), one normally selects a script- or language- specific keyboard layout. This, if I want to mix Thai, Lao, Khmer and Latin-1, I would normally switch between four different keyboard layouts. (I'm seriously considering knocking one up for IPA.) However, if a lot of keyboard layouts are enabled, switching keyboards is as tedious as switching fonts.

Richard.

.



Relevant Pages

  • Re: Chinese character & pinyin frequency analysis
    ... Richard> Microsoft recently upgraded Notepad, ... It now searches one's fonts for characters not ... Richard> So which editor do you suggest? ... Richard> Selecting keyboard layouts. ...
    (sci.lang)
  • The font "Arabic Transparent" (artro.ttf) is invalid
    ... This font, when installed, causes WEFT to crash. ... Error code Message Details ... I2100 Characters in a unicode range are present in the font, ...
    (microsoft.public.windowsxp.general)
  • Re: apostrophe with space in Word
    ... Helvetica as known in the computing world didn't get going until Apple included it in the Mac Operating system. ... Because it became so popular, Microsoft decide to add to their font collection, but because it was patented they decide to create their own version. ... So it may help to explain that the Unicode character set now defines about ... important to know the answer to the question "How many glyphs (characters) ...
    (microsoft.public.mac.office.word)
  • Re: Russian language support
    ... Actually in our previous test we didnt do language transition properly ... bytes, or if in Unicode, just a string of 2-byte values, each of which is ... it selects a font to use to display ... TTF fonts in Windows CE map Unicode characters into suitable glyphs. ...
    (microsoft.public.windowsce.platbuilder)
  • Re: How to set menu shortcuts, and fonts?
    ... So instead of appending a shortcut letter in parentheses I used ... In Vista this doesn't work. ... In Vista the title bar and menu bar are in Meiryo and are antialiased even at font size 9. ... (Somehow foreign versions of XP understand a default font of MS Gothic and they use a different font instead, which works if only Italian characters are used in the captions. ...
    (microsoft.public.dotnet.languages.csharp)