[reportlab-users] pyRXP vs ONIX DTD
Marius Gedminas
reportlab-users@reportlab.com
Mon, 9 Dec 2002 10:13:39 +0200
On Fri, Dec 06, 2002 at 09:13:13AM +0000, Robin Becker wrote:
> In article <20021206080823.GA612@codeworks.lt>, Marius Gedminas <marius@codeworks.lt> writes
> ......
> >>
> >> Are you saying there's a unique utf-8 version of these 16bit things?
> >
> >UTF-8 can technically express codes from 0 to 2^31 - 1, although
> >currently both Unicode and ISO 10646 limit their codespaces to
> >0..0x10fff
> >
> yes I knew that in theory, but thought there were some ambiguity problems in conversion back and
> forth. I saw another table y'day saying that shortest aren't required.
I seem to remember that this was so in earlier versions of Unicode, but
later versions ammended the requirement to "Thou Shalt Not Accept
Overlong UTF-8 Sequences".
> http://www.unicode.org/versions/corrigendum1.html
Hey, that's that the page above says: "Unicode Technical Committee has
modified the definition of UTF-8 to forbid conformant implementations
from interpreting non-shortest forms".
> Anyhow I guess the question arises as to whether there's a convenient description of an
> acceptable algorithm that maps the decimal
>
> 0 <= dddddd <= 2^31-1
Yes.
> I assume we need two flavours, one for a bigendian and one for a little endian, but again
> perhaps I'm overlooking something obvious. The nice table from the above
>
> Scalar Value UTF-16 1st Byte 2nd Byte 3rd Byte 4th Byte
> 00000000 0xxxxxxx 00000000 0xxxxxxx 0xxxxxxx
>
> 00000yyy yyxxxxxx 00000yyy yyxxxxxx 110yyyyy 10xxxxxx
>
> zzzzyyyy yyxxxxxx zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx
>
> 000uuuuu zzzzyyyy 110110ww wwzzzzyy 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx
> yyxxxxxx 110111yy yyxxxxxx
>
>
>
>
> * Where uuuuu = wwww + 1 (to account for addition of 1000016 as in Section 3.7, Surrogates).
>
>
> seems only to go to 2^16-1,
(I assume you meant 2^21)
> but am I always to assume the bigendian ordering for the Scalar
> Value?
Just treat the scalar value as a 32-bit integer (endianness is
irrelevant when you do not want to express it as a sequence of bytes).
Then, e.g. 0000000000000000zzzzyyyyyyxxxxxx maps to 1110zzzz 10yyyyyy
10xxxxxx in UTF-8 and to zzzzyyyyyxxxxxx in UTF-16 (which is zzzzyyyy
yyxxxxxx in UTF-16BE and yyxxxxxx zzzzyyyy in UTF-16LE, when you
serialize it).
Marius Gedminas
--
If vegetarians eat vegetables, what do humanitarians eat?