Unicode (Was: Re: subjective feelings about actions?)

jim_breen_at_hotmail.com
Date: 09/10/04


Date: 10 Sep 2004 22:51:52 GMT


?????? <SrinTuar@example.net> dixit:

>jim_breen@hotmail.com wrote:
>> Sean Gilbertson <sean-lists@nospam.bluebeard.org> dixit:
>>>though it
>>>would be wrong to say that UTF-8 is not employed by many developers
>>>instead of UTF-16 and UTF-32 to save space in memory or disk.

>> I have yet to meet a developer who uses it for that reason. It is
>> not that widely used as an internal format. It was developed for
>> use in files, particularly text files, and that's where it is most
>> useful.

>Hrm, I would say it is used as an internal encoding more widely than
>any other unicode encoding. (Mainly because for many ascii applications,
>no changes whatsoever are required to become utf-8 capable, not even
>a recompile) furthermore most linux apps default to it, whereas most
>windows apps default to ascii. (some of the larger ones use UCS-2 or
>UTF-16)

It depends on the use of the text. If you are just dealing with
text strings, UTF8 is fine and the way to go. I use it that way
internally too. If you are working at the character level, e.g. in
a text editor, working on raw UTF8 can be a chore. Many apps doing
character by character processing expand every character into 16 or
32-bit whatevers and work at that level, converting it back to UTF8
as it goes to file, display, etc.

>I see UTF-16 as the worst choice possible of any unicode encoding,
>yet it is still chosen by few large projects (foolishly, imo) most
>notably mozilla, javascript interpreters, the programming language
>java, and libiiimf. (I wont go into too many details of why utf-16
>is less prefarable or this will become a cross-post from linux-utf8)

"Worst" is a bit strong. The vast majority of Unicode codepoints that
one will ever want to deal with fit into 16 bits. There is probably a good
case for dealing with the others as exceptions.

>Some apps that do lots of internal codepoint transformations may
>choose utf-32, but this is relatively rare. Its just about as optimal
>to case convert directly in a multibyte encoding anyway once you
>consider composing characters etc.

>UTF7 and iso-2022 dont play nice with parsers and text file loaders
>that consider various ascii charaters to be syntax significant.

$B@.$kDx(B.

>EUC-JP and UTF-8 generally dont have that problem.
>Furthermore deleting a single byte from a file in UTF-8 will at most
>corrupt a single charater, but could render an entire file invalid in
>those stateful encodings.

Indeed. I'm no fan of stateful codings. I use EUC internally at present
(I can handle Latin diacritics via JIS212), but at some stage I need
to move things like xjdic and WWWJDIC to internal Unicode working. I'm
dreading it, because either internal UTF8 or internal UCS16/32 will mean
a major amount of pain. For space/efficiency reasons I'm leaning to UTF8,
but it means a lot of fiddling.

-- 
Jim Breen        http://www.csse.monash.edu.au/~jwb/
Computer Science & Software Engineering,
Monash University, VIC 3800, Australia 
$B%8%`!&%V%j!<%s(B@$B%b%J%7%eBg3X(B


Relevant Pages

  • Re: C# and encodings
    ... different encoding than Unicode does (Unicode set uses three ... Any character encoding that is not Unicode by definition uses a different encoding than Unicode does. ... The point is that the Unicode "character" 0xfeff is not representable in any ANSI code page, and is treated specially by stripping it from input rather than replacing it with the "default character". ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: C# and encodings
    ... But if windows has numerous code pages, ... encoding, and thus have only 255 code points matched to characters? ... Unicode can't be represented in only 8-bits, ... But Notepad supports Unicode and yet it only recognizes 255 character, ...
    (microsoft.public.dotnet.languages.csharp)
  • C# and encodings
    ... Can code page support Unicode coded character set, ... Are there also 8-bit code pages which use Unicode character ... encoding, and thus have only 255 code points matched to characters? ... mark written in UTF-8. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: C# and encodings
    ... different encoding than Unicode does ... encoded into a binary stream using an encoding that either supports the ... So if code page supports only a subset of Unicode character set… ... characters as those in Unicode coded character set, ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: regex & utf8
    ... Or do I have to first Encode everything into UTF8? ... first and then convert it from whatever encoding it is to UTF8? ... Perl knows when a filehandle uses Perl's internal Unicode encodings ... the Unicode character scheme when presented with Unicode data--or ...
    (perl.beginners)