[reportlab-users] pyRXP vs ONIX DTD

Stuart Bishop reportlab-users@reportlab.com
Thu, 5 Dec 2002 14:20:26 +1100


I'm trying to use pyRXP to validate ONIX documents that I am 
generating. However, I am getting lots of 'not a valid 8-bit XML 
character' warnings unless I set the IgnoreEntities flag to true. The 
ONIX DTD looks fine to me, although I'm no expert. The first character 
that is picked up is "Œ" , which seems valid to my cursory reading 
of the XML 1.0 spec.

Can anyone confirm if this is a problem with the ONIX DTD, or a bug or 
limitation of the RXP engine being used by pyRXP? Similar issues appear 
to have been raised in the past with regard to Docbook, with the 
solution being to build RXP with unicode support.
I'd guess that the DTD is being retrieved by the C engine, so would 
have no bearing on Python's Unicode support. I'd really like to be able 
to validate with maximum paranoia, as I'm generating many ONIX records 
from untrusted source data.
Minimal example:

import pyRXP
ONIX = u'''<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE ONIXMessage SYSTEM
     "http://www.editeur.org/onix/2.0/reference/onix-international.dtd">
<ONIXMessage></ONIXMessage>
'''
pyRXP.Parser().parse(ONIX)


And the output:

Traceback (most recent call last):
   File "bug.py", line 7, in ?
     pyRXP.Parser().parse(ONIX)
pyRXP.Error: Error: 0x152 is not a valid 8-bit XML character
  in entity "xhtml-special" at line 33 char 25 of 
http://www.editeur.org/onix/2.0/reference/xhtml-special.ent
  in entity "MainModule" at line 2059 char 16 of 
http://www.editeur.org/onix/2.0/reference/onix-international.elt
  in unnamed entity at line 625 char 13 of 
http://www.editeur.org/onix/2.0/reference/onix-international.dtd
error return=1
0x152 is not a valid 8-bit XML character
Parse Failed!

-- 
Stuart Bishop <zen@shangri-la.dropbear.id.au>
http://shangri-la.dropbear.id.au/