[reportlab-users] new pyRXP

Stuart Bishop reportlab-users@reportlab.com
Tue, 11 Mar 2003 18:35:14 +1100


On Monday, March 10, 2003, at 11:13  PM, Robin Becker wrote:

> I have checked in a bunch of changes to pyRXP to allow for comments =
and
> processing instructions to be made into special nodes. Things seem to
> work OK, but it will affect those who currently rely on them being
> inline.
>
> I have one change in mind to allow >8bit numerical entities in utf-8
> documents, but have encountered a problem in rxp which I don't really
> understand. Basically when rxp sees a utf-8 declaration it switches=20
> that
> to CE_unspecified_ascii_superset. Under what circumstances can we =
allow
>> 8bit numerical entities?

In case you didn't already think of it, this approach will break if
a document is fed in that isn't encoded in ASCII or UTF-8 (which is a
perfectly valid restriction).

Character entities are expanded in PCDATA content & attribute values.
They are not expanded in comments and CDATA sections. They are illegal
in element names, attribute names and the bit before whitespace in
a processing instruction. I don't know about the=20
bit-after-any-whitespace
of a processing instruction, but I suspect that they are not expanded.

Does that help, or where you asking a different question?
http://www.w3.org/TR/REC-xml#sec-entexpand might help.

btw, are you just trying to make character references work better,
or are you trying to go the hole hog and have pyRXP recognize
<=E2=9D=A4>Unicode</=E2=9D=A4> as well formed and validatable XML (in =
which case
I guess a Unicode version of pyRXP would become trivial, simply
running pyRXP's results through the utf-8 decoder)?

--=20
Stuart Bishop <zen@shangri-la.dropbear.id.au>
http://shangri-la.dropbear.id.au/