[reportlab-users] pyRXP vs ONIX DTD

Marius Gedminas reportlab-users@reportlab.com
Thu, 5 Dec 2002 14:17:22 +0200

On Thu, Dec 05, 2002 at 10:25:59AM +0000, Robin Becker wrote:
> In article <20021205095951.GA26876@codeworks.lt>, Marius Gedminas
> <marius@codeworks.lt> writes
> ......
> >It's been some time since I last read XML specs, so I could be very
> >wrong, but I seem to remember that XML is basically inseparable from
> >Unicode.
> >
> >HTH,
> >Marius Gedminas
> You're probably right, but that's why we have utf-8 ie an 8 bit
> encoding. The right thing to do is to switch this to 16 bit or perhaps
> 32 bit or 64 bit, handle all known BOM's and then watch paint dry. Eight
> bit encodings were always sufficient, but modernists want to use up all
> their new computing power :).

The problem is character entities like '&#1234;', right?  So why not just
translate them to UTF-8 strings instead of throwing exceptions?  That's
assuming pyRXP works with UTF-8 internally; I'm not familiar with it,
and probably should abstain from discussing things I know very little
about.  I only wanted to clarify that &#9999; is OK even when the
document is declared as <?xml version="1.0" encoding="ISO-8859-1"?>.

See http://www.w3.org/TR/REC-xml#dt-charref

> pyRXP is open source so anybody could try and switch it to 16 bit.

(I'm not sure that's a good idea; Unicode is 20.1 bits wide, and UTF-16
combines all the disadvantages of both UTF-8 and UTF-32.)

Marius Gedminas
If you can't understand it, it is intuitively obvious.