[reportlab-users] Re: pyRXP vs ONIX DTD (and XHTML)
Stuart Bishop
reportlab-users@reportlab.com
Sat, 7 Dec 2002 12:40:23 +1100
On Thursday, December 5, 2002, at 09:32 PM, Andy Robinson wrote:
>> It's been some time since I last read XML specs, so I could be very
>> wrong, but I seem to remember that XML is basically inseparable from
>> Unicode.
>
> There were 8-bit and 16-bit options when building pyRXP.
> AFAIR it passed 100% of the unicode tet suite in 16 bit mode
> and only passed the non-Unicode-related ones in 8 bit.
> If someone wants to play around with 16 bit builds and
> learn more about it, that's great.
>
> Personally I still believe that Python has its own Unicode
> library which should do conversions explicitly. Most Python
> apps are not yet using Unicode strings. But if someone can
> suggest a compatibility route I would welcome it.
In my original problem, the issue was not the XML code I am
generating, but the DTD I was validating against. I can't see
how Python's Unicode libraries would help in this situation -
the RXP library is trying to verify that a sequence of 6 or 7
8-bit characters represents a valid 16-bit character, and this
is failing.
Solutions?
- Build RXP with 16-bit support. This would involve
the glue code returning Unicode strings. I think the only
con would be memory consumption, and breaking some
existing code that can't cope with getting a Unicode string
instead of an 8-bit string? A few quick tests show that this
is a bit more involved than simply changing CHAR_SIZE
in setup.py. I guess the glue needs to be modified to return
Unicode strings and use codecs.utf_16_decode at
the relevant times. I swore off C years ago so am a bit
slow here :-)
- Hack RXP to work internally in UTF-8. When it hits an
'invalid 8-bit entity', have RXP expand it into the two-char
UTF-8 representation. Solves the memory issue, but are
there other side effects? No idea how easy this one is to
implement.
- Run RXP in 'ExpandCharacterEntities=0' mode, and have
the glue validate and expand the character entities itself.
Probably a performance nightmare.
Whatever mode is chosen, pyRXP.Parser().parse() will need
to return Unicode strings. We cannot leave expansion up to the
application as it has no idea if this should be done for a particular
block of data (we don't want to do entity expansions on data
pulled from a CDATA section or risk corrupting them).
This is a pretty major problem with the parser - according to
http://www.w3.org/TR/REC-xml#charsets pyRXP is failing to do
something it *must* be able to do. Also, same issue I was having
with ONIX is also true of XHTML 1.0:
XHTML=u'''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><body></body></html>'''
pyRXP.Parser().parse(XHTML)
--
Stuart Bishop <zen@shangri-la.dropbear.id.au>
http://shangri-la.dropbear.id.au/