[reportlab-users] Re: pyRXP vs ONIX DTD (and XHTML)

Robin Becker reportlab-users@reportlab.com
Sat, 7 Dec 2002 10:45:39 +0000


In article <DA4D5674-0984-11D7-86D5-000393B63DDC@shangri-
la.dropbear.id.au>, Stuart Bishop <zen@shangri-la.dropbear.id.au> writes
....... agreed I understood the problem was in the imported DTD
>
>Solutions?
>        - Build RXP with 16-bit support. This would involve
>          the glue code returning Unicode strings. I think the only
>          con would be memory consumption, and breaking some
>          existing code that can't cope with getting a Unicode string
>          instead of an 8-bit string? A few quick tests show that this
>          is a bit more involved than simply changing CHAR_SIZE
>          in setup.py. I guess the glue needs to be modified to return
>          Unicode strings and use codecs.utf_16_decode at
>          the relevant times. I swore off C years ago so am a bit
>          slow here :-)
>
I think this is the only real solution
>        - Hack RXP to work internally in UTF-8. When it hits an
>          'invalid 8-bit entity', have RXP expand it into the two-char
>          UTF-8 representation. Solves the memory issue, but are
>          there other side effects? No idea how easy this one is to
>          implement.
>
the problem here is that I assume we have to know the document is
supposed to be in UTF-8. I haven't looked internally at RXP to see if
this is easily know to the parser. It seems odd to me that the DTD can
be used in any document. Can a utf-8 doc define itself using a unicode
DTD? I think this is the one that will appeal to Andy(in his pointy
haired manager incarnation), but I would like to get some agreement on
when this hack is allowed.
>        - Run RXP in 'ExpandCharacterEntities=0' mode, and have
>          the glue validate and expand the character entities itself.
>          Probably a performance nightmare.
>
this is what we have been doing, but we are passing the XML fragments
around into other bits of XML.


>Whatever mode is chosen, pyRXP.Parser().parse() will need
>to return Unicode strings.

I disagree here. If the parser is using UTF-8 then it may return that.
You can argue that there should be an alternative return mechanism ie do
the conversion to unicode, but that would also be only one statement
away in modern Python.


>We cannot leave expansion up to the
>application as it has no idea if this should be done for a particular
>block of data (we don't want to do entity expansions on data
>pulled from a CDATA section or risk corrupting them).
>
>This is a pretty major problem with the parser - according to
>http://www.w3.org/TR/REC-xml#charsets pyRXP is failing to do
>something it *must* be able to do. Also, same issue I was having
>with ONIX is also true of XHTML 1.0:
>
>XHTML=u'''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
>    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
><html xmlns="http://www.w3.org/1999/xhtml"><body></body></html>'''
>pyRXP.Parser().parse(XHTML)
>
>

-- 
Robin Becker