[reportlab-users] pyRXP vs ONIX DTD
Robin Becker
reportlab-users@reportlab.com
Thu, 5 Dec 2002 09:25:03 +0000
In article <7F7B96AB-0800-11D7-B573-000393B63DDC@shangri-la.dropbear.id.au>, Stuart Bishop
<zen@shangri-la.dropbear.id.au> writes
>I'm trying to use pyRXP to validate ONIX documents that I am
>generating. However, I am getting lots of 'not a valid 8-bit XML
>character' warnings unless I set the IgnoreEntities flag to true. The
>ONIX DTD looks fine to me, although I'm no expert. The first character
>that is picked up is "Œ" , which seems valid to my cursory reading
>of the XML 1.0 spec.
>
>Can anyone confirm if this is a problem with the ONIX DTD, or a bug or
>limitation of the RXP engine being used by pyRXP? Similar issues appear
>to have been raised in the past with regard to Docbook, with the
>solution being to build RXP with unicode support.
>I'd guess that the DTD is being retrieved by the C engine, so would
>have no bearing on Python's Unicode support. I'd really like to be able
>to validate with maximum paranoia, as I'm generating many ONIX records
>from untrusted source data.
>Minimal example:
...... Well I think the error message says it all. The document says it's utf-8 and then tries
to expand a non-8 bit char. I suppose we have to say that's impossible.
>>> pyRXP.Parser()("<a>ÿ</a>")
('a', None, ['\xff'], None)
>>> pyRXP.Parser()("<a>Ā</a>")
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
Error: Error: 0x100 is not a valid 8-bit XML character
in unnamed entity at line 1 char 10 of [unknown]
I'm not an expert on XML that's to big and vague a subject for a simple
person like myself. The poor parser is in 8 bit only at present so it
will not handle entity declarations like
<!ENTITY bdquo "„"> <!-- double low-9 quotation mark, U+201E NEW -->
I assume there is some universal encoding in which this is understandable.
--
Robin Becker