[reportlab-users] pyRXP vs ONIX DTD
Robin Becker
reportlab-users@reportlab.com
Fri, 6 Dec 2002 09:13:13 +0000
In article <20021206080823.GA612@codeworks.lt>, Marius Gedminas <marius@codeworks.lt> writes
......
>>
>> Are you saying there's a unique utf-8 version of these 16bit things?
>
>UTF-8 can technically express codes from 0 to 2^31 - 1, although
>currently both Unicode and ISO 10646 limit their codespaces to
>0..0x10fff
>
yes I knew that in theory, but thought there were some ambiguity problems in conversion back and
forth. I saw another table y'day saying that shortest aren't required.
http://www.unicode.org/versions/corrigendum1.html
Anyhow I guess the question arises as to whether there's a convenient description of an
acceptable algorithm that maps the decimal
0 <= dddddd <= 2^31-1
I assume we need two flavours, one for a bigendian and one for a little endian, but again
perhaps I'm overlooking something obvious. The nice table from the above
Scalar Value UTF-16 1st Byte 2nd Byte 3rd Byte 4th Byte
00000000 0xxxxxxx 00000000 0xxxxxxx 0xxxxxxx
00000yyy yyxxxxxx 00000yyy yyxxxxxx 110yyyyy 10xxxxxx
zzzzyyyy yyxxxxxx zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx
000uuuuu zzzzyyyy 110110ww wwzzzzyy 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx
yyxxxxxx 110111yy yyxxxxxx
* Where uuuuu = wwww + 1 (to account for addition of 1000016 as in Section 3.7, Surrogates).
seems only to go to 2^16-1, but am I always to assume the bigendian ordering for the Scalar
Value?
......
>UTF-8 is endian-independent, that's one of its advantages (UTF-16 and
>UTF-32, and also UCS-2 and UCS-4 come in two flavours -- big-endian and
>little-endian. BTW, UCS-4 is identical to UTF-32, and UCS-2 differs
>from UTF-16 in that UTF-16 can express characters above 0xffff using
>surrogate pairs, while UCS-2 is limited to Basic Multilingual Plane
>(characters 0-0xffff)).
>
>Aren't encodings fun? It's almost as bad as understanding time
>measurements (leap seconds, oh my...).
..... oh yes indeed
--
Robin Becker