[reportlab-users] Re: pyRXP vs ONIX DTD (and XHTML)

Marius Gedminas reportlab-users@reportlab.com
Mon, 9 Dec 2002 12:17:31 +0200


On Mon, Dec 09, 2002 at 12:51:46PM +1100, Stuart Bishop wrote:
> On Saturday, December 7, 2002, at 09:45  PM, Robin Becker wrote:
> 
> >the problem here is that I assume we have to know the document is
> >supposed to be in UTF-8. I haven't looked internally at RXP to see if
> >this is easily know to the parser. It seems odd to me that the DTD can
> >be used in any document. Can a utf-8 doc define itself using a unicode
> >DTD? I think this is the one that will appeal to Andy(in his pointy
> >haired manager incarnation), but I would like to get some agreement on
> >when this hack is allowed.
> 
> From http://www.w3.org/TR/REC-xml#charencoding, it seems that the
> document and all external documents can each use a different
> encoding, and is assumed to be UTF-8 or UTF-16 unless otherwise
> stated (its the parsers job to detect UTF-8 or UTF-16). So you can 
> declare
> a XHTML 1.0 document using us-ascii encoding. I assume this also means
> a validation warning would be raised if you put an œ in this 
> us-ascii
> document (?).

I think not.  validator.w3c.org accepts the following as valid XHTML:

  <?xml version="1.0" encoding="us-ascii"?>
  <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
  <html xmlns="http://www.w3.org/1999/xhtml">
  <head>
  <title>Foo</title>
  </head>
  <body>
  <p>&oelig;</p>
  </body>
  </html>

The way I understand XML spec, you can have a document composed from a
number of files ("external entities"), and each file can come in a
different charset (and that's not mentioning character references, which
are always specified as Unicode code points).  The easiest way to
combine all that into a single document is to convert everyting into
some encoding of Unicode.

Marius Gedminas
-- 
1 4m 5o 3l337! just got r00t on this <a href="127.0.0.1">k3wl site</a> j00
sux0r5!