[reportlab-users] pyRXP vs ONIX DTD

Marius Gedminas reportlab-users@reportlab.com
Thu, 5 Dec 2002 11:59:51 +0200


On Thu, Dec 05, 2002 at 09:25:03AM +0000, Robin Becker wrote:
> ...... Well I think the error message says it all. The document says
> it's utf-8 and then tries to expand a non-8 bit char. I suppose we
> have to say that's impossible.
> 
> >>> pyRXP.Parser()("<a>&#255;</a>")
> ('a', None, ['\xff'], None)
> >>> pyRXP.Parser()("<a>&#256;</a>")
> Traceback (most recent call last):
>   File "<interactive input>", line 1, in ?
> Error: Error: 0x100 is not a valid 8-bit XML character 
> in unnamed entity at line 1 char 10 of [unknown]

Um, as far as I remember XML, numeric character entities are independent
from the document's charset and are always interpreted as Unicode
character codes.

> I'm not an expert on XML that's to big and vague a subject for a simple
> person like myself. The poor parser is in 8 bit only at present so it
> will not handle entity declarations like
> 
> <!ENTITY bdquo "&#8222;"> <!-- double low-9 quotation mark, U+201E NEW -->

Then it's not an XML parser, is it?

> I assume there is some universal encoding in which this is understandable.

Unicode (internal representation doesn't matter -- UTF-8, UTF-16,
UTF-32...).

It's been some time since I last read XML specs, so I could be very
wrong, but I seem to remember that XML is basically inseparable from
Unicode.

HTH,
Marius Gedminas
-- 
Hanlon's Razor:
        Never attribute to malice that which is adequately explained
        by stupidity.