[reportlab-users] Re: pyRXP vs ONIX DTD (and XHTML)
Marius Gedminas
reportlab-users@reportlab.com
Mon, 9 Dec 2002 12:17:31 +0200
On Mon, Dec 09, 2002 at 12:51:46PM +1100, Stuart Bishop wrote:
> On Saturday, December 7, 2002, at 09:45 PM, Robin Becker wrote:
>
> >the problem here is that I assume we have to know the document is
> >supposed to be in UTF-8. I haven't looked internally at RXP to see if
> >this is easily know to the parser. It seems odd to me that the DTD can
> >be used in any document. Can a utf-8 doc define itself using a unicode
> >DTD? I think this is the one that will appeal to Andy(in his pointy
> >haired manager incarnation), but I would like to get some agreement on
> >when this hack is allowed.
>
> From http://www.w3.org/TR/REC-xml#charencoding, it seems that the
> document and all external documents can each use a different
> encoding, and is assumed to be UTF-8 or UTF-16 unless otherwise
> stated (its the parsers job to detect UTF-8 or UTF-16). So you can
> declare
> a XHTML 1.0 document using us-ascii encoding. I assume this also means
> a validation warning would be raised if you put an œ in this
> us-ascii
> document (?).
I think not. validator.w3c.org accepts the following as valid XHTML:
<?xml version="1.0" encoding="us-ascii"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Foo</title>
</head>
<body>
<p>œ</p>
</body>
</html>
The way I understand XML spec, you can have a document composed from a
number of files ("external entities"), and each file can come in a
different charset (and that's not mentioning character references, which
are always specified as Unicode code points). The easiest way to
combine all that into a single document is to convert everyting into
some encoding of Unicode.
Marius Gedminas
--
1 4m 5o 3l337! just got r00t on this <a href="127.0.0.1">k3wl site</a> j00
sux0r5!