[reportlab-users] pyRXP vs ONIX DTD

Thu, 5 Dec 2002 09:25:03 +0000

In article <7F7B96AB-0800-11D7-B573-000393B63DDC@shangri-la.dropbear.id.au>, Stuart Bishop
<zen@shangri-la.dropbear.id.au> writes
>I'm trying to use pyRXP to validate ONIX documents that I am 
>generating. However, I am getting lots of 'not a valid 8-bit XML 
>character' warnings unless I set the IgnoreEntities flag to true. The 
>ONIX DTD looks fine to me, although I'm no expert. The first character 
>that is picked up is "&#338;" , which seems valid to my cursory reading 
>of the XML 1.0 spec.
>
>Can anyone confirm if this is a problem with the ONIX DTD, or a bug or 
>limitation of the RXP engine being used by pyRXP? Similar issues appear 
>to have been raised in the past with regard to Docbook, with the 
>solution being to build RXP with unicode support.
>I'd guess that the DTD is being retrieved by the C engine, so would 
>have no bearing on Python's Unicode support. I'd really like to be able 
>to validate with maximum paranoia, as I'm generating many ONIX records 
>from untrusted source data.
>Minimal example:
...... Well I think the error message says it all. The document says it's utf-8 and then tries
to expand a non-8 bit char. I suppose we have to say that's impossible.

>>> pyRXP.Parser()("<a>&#255;</a>")
('a', None, ['\xff'], None)
>>> pyRXP.Parser()("<a>&#256;</a>")
Traceback (most recent call last):
  File "<interactive input>", line 1, in ?
Error: Error: 0x100 is not a valid 8-bit XML character 
in unnamed entity at line 1 char 10 of [unknown]

I'm not an expert on XML that's to big and vague a subject for a simple
person like myself. The poor parser is in 8 bit only at present so it
will not handle entity declarations like

<!ENTITY bdquo "&#8222;"> <!-- double low-9 quotation mark, U+201E NEW -->

I assume there is some universal encoding in which this is understandable.
-- 
Robin Becker