[reportlab-users] pyRXP vs ONIX DTD

Mon, 9 Dec 2002 10:13:39 +0200

On Fri, Dec 06, 2002 at 09:13:13AM +0000, Robin Becker wrote:
> In article <20021206080823.GA612@codeworks.lt>, Marius Gedminas <marius@codeworks.lt> writes
> ......
> >> 
> >> Are you saying there's a unique utf-8 version of these 16bit things?
> >
> >UTF-8 can technically express codes from 0 to 2^31 - 1, although
> >currently both Unicode and ISO 10646 limit their codespaces to
> >0..0x10fff
> >
> yes I knew that in theory, but thought there were some ambiguity problems in conversion back and
> forth. I saw another table y'day saying that shortest aren't required.

I seem to remember that this was so in earlier versions of Unicode, but
later versions ammended the requirement to "Thou Shalt Not Accept
Overlong UTF-8 Sequences".

> http://www.unicode.org/versions/corrigendum1.html

Hey, that's that the page above says: "Unicode Technical Committee has
modified the definition of UTF-8 to forbid conformant implementations
from interpreting non-shortest forms".

> Anyhow I guess the question arises as to whether there's a convenient description of an
> acceptable algorithm that maps the decimal
> 
> 0 <= dddddd <= 2^31-1

Yes.

> I assume we need two flavours, one for a bigendian and one for a little endian, but again
> perhaps I'm overlooking something obvious. The nice table from the above
>
> Scalar Value      UTF-16            1st Byte 2nd Byte 3rd Byte   4th Byte
> 00000000 0xxxxxxx 00000000 0xxxxxxx 0xxxxxxx      
> 
> 00000yyy yyxxxxxx 00000yyy yyxxxxxx 110yyyyy 10xxxxxx    
> 
> zzzzyyyy yyxxxxxx zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx  
> 
> 000uuuuu zzzzyyyy 110110ww wwzzzzyy 11110uuu 10uuzzzz 10yyyyyy  10xxxxxx
> yyxxxxxx          110111yy yyxxxxxx
> 
> 
> 
>     
> *  Where uuuuu = wwww + 1 (to account for addition of 1000016 as in Section 3.7, Surrogates).
> 
> 
> seems only to go to 2^16-1,

(I assume you meant 2^21)

> but am I always to assume the bigendian ordering for the Scalar
> Value?

Just treat the scalar value as a 32-bit integer (endianness is
irrelevant when you do not want to express it as a sequence of bytes).
Then, e.g. 0000000000000000zzzzyyyyyyxxxxxx maps to 1110zzzz 10yyyyyy
10xxxxxx in UTF-8 and to zzzzyyyyyxxxxxx in UTF-16 (which is zzzzyyyy
yyxxxxxx in UTF-16BE and yyxxxxxx zzzzyyyy in UTF-16LE, when you
serialize it).

Marius Gedminas
-- 
If vegetarians eat vegetables, what do humanitarians eat?