[reportlab-users] pyRXP vs ONIX DTD

Marius Gedminas reportlab-users@reportlab.com
Fri, 6 Dec 2002 10:08:24 +0200


On Thu, Dec 05, 2002 at 06:43:03PM +0000, Robin Becker wrote:
> >> pyRXP is open source so anybody could try and switch it to 16 bit.
> >
> >(I'm not sure that's a good idea; Unicode is 20.1 bits wide, and UTF-16
> >combines all the disadvantages of both UTF-8 and UTF-32.)
> >
> >Marius Gedminas
> I'm not sure I understand this really as I'm not an expert on encodings.
> And certainly not the author of RXP.
> 
> Are you saying there's a unique utf-8 version of these 16bit things?

UTF-8 can technically express codes from 0 to 2^31 - 1, although
currently both Unicode and ISO 10646 limit their codespaces to
0..0x10fff

> I
> had thought there were some problems. Is the byte sequence always
> defined bigendian/littleendian?

UTF-8 is endian-independent, that's one of its advantages (UTF-16 and
UTF-32, and also UCS-2 and UCS-4 come in two flavours -- big-endian and
little-endian.  BTW, UCS-4 is identical to UTF-32, and UCS-2 differs
from UTF-16 in that UTF-16 can express characters above 0xffff using
surrogate pairs, while UCS-2 is limited to Basic Multilingual Plane
(characters 0-0xffff)).

Aren't encodings fun?  It's almost as bad as understanding time
measurements (leap seconds, oh my...).

Marius Gedminas
-- 
Those parts of the system that you can hit with a hammer (not advised)
are called hardware; those program instructions that you can only curse
at are called software.
                -- Levitating Trains and Kamikaze Genes: Technological
                   Literacy for the 1990's.