Re: check for non-english




<todaysmulan@xxxxxxxxxxxxx> wrote in message news:1153098675.133026.198010@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
how does a translation machine check for chinese korean japanese and
other weird languages ?


By looking at what code the computer encounters in the document. Chinese from mainland China use GB encoding, whilst Chinese from Taiwan uses Big5. Japanese can be found encoded in JIS, Shift-JIS and other encodings, and Korean has its own too. The characters used in each set of encodings is slightly different, and from any text the character codes fall into certain ranges which can be used as a guess as to what encoding, and hence language it comes from. However, codings like GB, JIS and Korean EUC inhabit the same code ranges, which makes it difficult for a human to know what language it is in, unless he looks at the character display. This requires some knowledge of the languages concerned. If the displayed text is gibberish, it is very likely that the wrong encoding was selected. Therefore, any machine translation of that text using that would be wrong. This is why online translation software requires you to select input and target output languages.

Dyl.

.



Relevant Pages

  • Re: Is reverse reading possible?
    ... >mixture of Chinese and English, ... do it straight off when you have non-uniform character sizes as ... without knowing the encoding method... ... Derive a grammar for a single ...
    (comp.lang.python)
  • Re: EDICT: azumashii
    ... The encoding and font is a real headache for me as I use three ... languages side by side in my PC. ... Since you're using the internet and browser to post, ... Cut and paste from the reopened file into the compose message ...
    (sci.lang.japan)
  • Re: metatags & unicode-based languages
    ... Chinese was written thousands of years before Unicode was invented, ... What you probably _mean_ is that you are intending to use some Unicode encoding for pages in Chinese. ... there is nothing special in the contents of tag attributes when you use a Unicode encoding. ... encoding with languages. ...
    (alt.html)
  • Re: RfD: XCHAR wordset (for UTF-8 and alike)
    ... >encoding can be used; latin-1 is most widely used, ... >languages, different char-sets have to be used, several of them ... How does this fit in with the wide character and internationalisation ...
    (comp.lang.forth)
  • locale (was: Accented characters in less and vim)
    ... Most people would use both languages interchangably; ... hence the correct locale should not be taken from ... presumable has its very own character encoding). ... :> Shouldn't $LANG always include an encoding defintion? ...
    (uk.comp.os.linux)