[reportlab-users] pyRXP vs ONIX DTD

Fri, 6 Dec 2002 09:13:13 +0000

In article <20021206080823.GA612@codeworks.lt>, Marius Gedminas <marius@codeworks.lt> writes
......
>> 
>> Are you saying there's a unique utf-8 version of these 16bit things?
>
>UTF-8 can technically express codes from 0 to 2^31 - 1, although
>currently both Unicode and ISO 10646 limit their codespaces to
>0..0x10fff
>
yes I knew that in theory, but thought there were some ambiguity problems in conversion back and
forth. I saw another table y'day saying that shortest aren't required.

http://www.unicode.org/versions/corrigendum1.html

Anyhow I guess the question arises as to whether there's a convenient description of an
acceptable algorithm that maps the decimal

0 <= dddddd <= 2^31-1

I assume we need two flavours, one for a bigendian and one for a little endian, but again
perhaps I'm overlooking something obvious. The nice table from the above

Scalar Value      UTF-16            1st Byte 2nd Byte 3rd Byte   4th Byte
00000000 0xxxxxxx 00000000 0xxxxxxx 0xxxxxxx      

00000yyy yyxxxxxx 00000yyy yyxxxxxx 110yyyyy 10xxxxxx    

zzzzyyyy yyxxxxxx zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx  

000uuuuu zzzzyyyy 110110ww wwzzzzyy 11110uuu 10uuzzzz 10yyyyyy  10xxxxxx
yyxxxxxx          110111yy yyxxxxxx

*  Where uuuuu = wwww + 1 (to account for addition of 1000016 as in Section 3.7, Surrogates).

seems only to go to 2^16-1, but am I always to assume the bigendian ordering for the Scalar
Value?

......
>UTF-8 is endian-independent, that's one of its advantages (UTF-16 and
>UTF-32, and also UCS-2 and UCS-4 come in two flavours -- big-endian and
>little-endian.  BTW, UCS-4 is identical to UTF-32, and UCS-2 differs
>from UTF-16 in that UTF-16 can express characters above 0xffff using
>surrogate pairs, while UCS-2 is limited to Basic Multilingual Plane
>(characters 0-0xffff)).
>
>Aren't encodings fun?  It's almost as bad as understanding time
>measurements (leap seconds, oh my...).
..... oh yes indeed
-- 
Robin Becker