[reportlab-users] new pyRXP

Robin Becker reportlab-users@reportlab.com
Tue, 11 Mar 2003 10:42:33 +0000


In article <FF6DC341-5393-11D7-BC8B-000393B63DDC@shangri-
la.dropbear.id.au>, Stuart Bishop <zen@shangri-la.dropbear.id.au> writes
>
>On Monday, March 10, 2003, at 11:13  PM, Robin Becker wrote:
>
>> I have checked in a bunch of changes to pyRXP to allow for comments and
>> processing instructions to be made into special nodes. Things seem to
>> work OK, but it will affect those who currently rely on them being
>> inline.
>>
>> I have one change in mind to allow >8bit numerical entities in utf-8
>> documents, but have encountered a problem in rxp which I don't really
>> understand. Basically when rxp sees a utf-8 declaration it switches 
>> that
>> to CE_unspecified_ascii_superset. Under what circumstances can we allow
>>> 8bit numerical entities?
>

>In case you didn't already think of it, this approach will break if
>a document is fed in that isn't encoded in ASCII or UTF-8 (which is a
>perfectly valid restriction).
>

yes I realized that, what I didn't expect is that utf-8 encoding gets
mapped to CE_unspecified_ascii_superset; then deep in the guts where
normally we get an error I want to check whether utf-8 is in use I
can't. 

>Character entities are expanded in PCDATA content & attribute values.
>They are not expanded in comments and CDATA sections. They are illegal
>in element names, attribute names and the bit before whitespace in
>a processing instruction. I don't know about the 
>bit-after-any-whitespace
>of a processing instruction, but I suspect that they are not expanded.
>
>Does that help, or where you asking a different question?
>http://www.w3.org/TR/REC-xml#sec-entexpand might help.
>

>btw, are you just trying to make character references work better,
>or are you trying to go the hole hog and have pyRXP recognize
><>Unicode</> as well formed and validatable XML (in which case
>I guess a Unicode version of pyRXP would become trivial, simply
>running pyRXP's results through the utf-8 decoder)?
>
I just want to allow for multi-byte entities inside 8 bit utf. Since we
have a decent translation I think that we should be able to do it.

I hadn't considered putting in an extra layer to map utf-16/32 into utf-
8. Does that sound feasible? It would mean altering the io etc. I'm
really quite vague on all these mappings. From my rather poor xml
background I see the problem in byte terms. In a utf-8 entity am I
allowed to switch to one of the latins etc?  Also just because we use an
8-bit encoding does that imply we're encoding 8-bit data ie can we do
translations of 16 bits (that would seem to depend on endianness etc)?
-- 
Robin Becker