Unicode (Was: Re: subjective feelings about actions?)
jim_breen_at_hotmail.com
Date: 09/10/04
- Next message: Kevin Wayne Williams: "Re: $B>[My(B"
- Previous message: jim_breen_at_hotmail.com: "Re: Q=24B=3E=5BMy=1B=28B?="
- In reply to: Srin・Tuar: "Re: subjective feelings about actions?"
- Next in thread: Bart Mathias: "Re: subjective feelings about actions?"
- Messages sorted by: [ date ] [ thread ]
Date: 10 Sep 2004 22:51:52 GMT
?????? <SrinTuar@example.net> dixit:
>jim_breen@hotmail.com wrote:
>> Sean Gilbertson <sean-lists@nospam.bluebeard.org> dixit:
>>>though it
>>>would be wrong to say that UTF-8 is not employed by many developers
>>>instead of UTF-16 and UTF-32 to save space in memory or disk.
>> I have yet to meet a developer who uses it for that reason. It is
>> not that widely used as an internal format. It was developed for
>> use in files, particularly text files, and that's where it is most
>> useful.
>Hrm, I would say it is used as an internal encoding more widely than
>any other unicode encoding. (Mainly because for many ascii applications,
>no changes whatsoever are required to become utf-8 capable, not even
>a recompile) furthermore most linux apps default to it, whereas most
>windows apps default to ascii. (some of the larger ones use UCS-2 or
>UTF-16)
It depends on the use of the text. If you are just dealing with
text strings, UTF8 is fine and the way to go. I use it that way
internally too. If you are working at the character level, e.g. in
a text editor, working on raw UTF8 can be a chore. Many apps doing
character by character processing expand every character into 16 or
32-bit whatevers and work at that level, converting it back to UTF8
as it goes to file, display, etc.
>I see UTF-16 as the worst choice possible of any unicode encoding,
>yet it is still chosen by few large projects (foolishly, imo) most
>notably mozilla, javascript interpreters, the programming language
>java, and libiiimf. (I wont go into too many details of why utf-16
>is less prefarable or this will become a cross-post from linux-utf8)
"Worst" is a bit strong. The vast majority of Unicode codepoints that
one will ever want to deal with fit into 16 bits. There is probably a good
case for dealing with the others as exceptions.
>Some apps that do lots of internal codepoint transformations may
>choose utf-32, but this is relatively rare. Its just about as optimal
>to case convert directly in a multibyte encoding anyway once you
>consider composing characters etc.
>UTF7 and iso-2022 dont play nice with parsers and text file loaders
>that consider various ascii charaters to be syntax significant.
$B@.$kDx(B.
>EUC-JP and UTF-8 generally dont have that problem.
>Furthermore deleting a single byte from a file in UTF-8 will at most
>corrupt a single charater, but could render an entire file invalid in
>those stateful encodings.
Indeed. I'm no fan of stateful codings. I use EUC internally at present
(I can handle Latin diacritics via JIS212), but at some stage I need
to move things like xjdic and WWWJDIC to internal Unicode working. I'm
dreading it, because either internal UTF8 or internal UCS16/32 will mean
a major amount of pain. For space/efficiency reasons I'm leaning to UTF8,
but it means a lot of fiddling.
-- Jim Breen http://www.csse.monash.edu.au/~jwb/ Computer Science & Software Engineering, Monash University, VIC 3800, Australia $B%8%`!&%V%j!<%s(B@$B%b%J%7%eBg3X(B
- Next message: Kevin Wayne Williams: "Re: $B>[My(B"
- Previous message: jim_breen_at_hotmail.com: "Re: Q=24B=3E=5BMy=1B=28B?="
- In reply to: Srin・Tuar: "Re: subjective feelings about actions?"
- Next in thread: Bart Mathias: "Re: subjective feelings about actions?"
- Messages sorted by: [ date ] [ thread ]
Relevant Pages
|