[reportlab-users] new pyRXP

Wed, 12 Mar 2003 13:10:37 +1100

On Tuesday, March 11, 2003, at 09:42  PM, Robin Becker wrote:

>>> I have one change in mind to allow >8bit numerical entities in utf-8
>>> documents, but have encountered a problem in rxp which I don't =
really
>>> understand. Basically when rxp sees a utf-8 declaration it switches
>>> that
>>> to CE_unspecified_ascii_superset. Under what circumstances can we=20
>>> allow
>>>> 8bit numerical entities

> yes I realized that, what I didn't expect is that utf-8 encoding gets
> mapped to CE_unspecified_ascii_superset; then deep in the guts where
> normally we get an error I want to check whether utf-8 is in use I
> can't.

I think I see. Arbitrary non-ASCII characters can be used anywhere
arbitrary ASCII characters can, and if RXP is running in 8 bit mode
it just doesn't care if the encoding is Latin-1 or Cyrillic or ASCII.
Character entities like &#xa3; are expanded to '\xa3'. Character
entities like &#x152; generate an error cause they are multibyte.

>> btw, are you just trying to make character references work better,
>> or are you trying to go the hole hog and have pyRXP recognize
>> <=01>Unicode</=01> as well formed and validatable XML (in which case
>> I guess a Unicode version of pyRXP would become trivial, simply
>> running pyRXP's results through the utf-8 decoder)?
>>
> I just want to allow for multi-byte entities inside 8 bit utf. Since =
we
> have a decent translation I think that we should be able to do it.
>
> I hadn't considered putting in an extra layer to map utf-16/32 into=20
> utf-
> 8. Does that sound feasible? It would mean altering the io etc. I'm

I wouldn't bother. To do Unicode properly with RXP, you have no choice=20=

but
to output Unicode strings and take the performance hit (which seems to=20=

be
the UTF-16 decoding step, not RXP in 16 bit mode). And if you have an=20
XML
file encoded in UTF-16 or UTF-32, I doubt very much that it only=20
contains
8 bit characters :-)

> really quite vague on all these mappings. =46rom my rather poor xml
> background I see the problem in byte terms. In a utf-8 entity am I
> allowed to switch to one of the latins etc?  Also just because we use=20=

> an
> 8-bit encoding does that imply we're encoding 8-bit data ie can we do
> translations of 16 bits (that would seem to depend on endianness etc)?

I think you need to decide what output you are supporting, as I think=20
this
will clarify what needs to happen inside the parser. As you probably=20
want
to return Python strings, you need to determine what 8 bit encodings are
supported.

If you just support Latin-1, I think pyRXP currently does everything=20
you want except that it dies when a multibyte character reference is=20
used in an entity definition. So the only thing that needs to change is=20=

for this to somehow
become valid, only raising an exception if the XML actually tries to=20
*use* that
entity reference (so the DTD parser would flag &oelig; as illegal when=20=

it attempts to parse it, and RXP would have to detect the use of=20
illegal entity
references?)

If you want to support other 8 bit character sets, it gets more complex,
as &pound; does not necessarily map to '\xa3' in that 8-bit encoding and
may not be representable at all.

It is pointless deciding that UTF-8 encoded output is given, as you
would always have to decode it into a Unicode string or for it to be
usable.

If you are always outputting Unicode strings, then you always want to
allow multibyte characters everywhere. It would be sensible to restrict
character sets of the input documents to ASCII, Latin-1 or UTF-8 to
avoid having to do a slow translation before handing off to RXP.

If you are only outputting Unicode strings if the source document
is utf-8 I think you are going to have to detect this.

Any of this rambling helping?

--=20
Stuart Bishop <zen@shangri-la.dropbear.id.au>
http://shangri-la.dropbear.id.au/