Re: getting out of LaTeX



Fri, 18 Aug 2006 19:33:16 +0300: "Jukka K. Korpela"
<jkorpela@xxxxxxxxx>: in sci.lang:

Capek? Hacek?

Would you believe that I intended to write "caron", then realized that it's
just character standard jargon and meant to write "hacek" instead but
produced a mixture?

I believe you, happens to me a lot too, just google and see how many
of my mistake are based on such processes.

Anyway, the point is that the hacek is not always easy to distinguish from
the breve. When you see a "v" like diacritic that has neither a clear angle
nor clear rounding, how can you guess which one it is?

Context. So that's the only thing an OCR program can rely on too. E.g.
if the text looks like Rumanian, it's probably a breve or circunflex,
if some other East-European language, more likely a hacek.

Anyway, the language setting is paramount here.

But this means that the program must have big problems with multilingual
texts that contain words and even phrases from different languages.

Probably, yes.
For such such cases, maybe there should be the possibility to mark the
language per text part, like Word can for the spell check. But
somebody would have to set them.

Automatic language recognition is a viable option too. AFAIK,
Finereader doesn't have it yet. Word >97 has. (2000 and up).
.