[reportlab-users] Re: pyRXP vs ONIX DTD (and XHTML)

Marius Gedminas reportlab-users@reportlab.com
Mon, 9 Dec 2002 12:17:31 +0200

On Mon, Dec 09, 2002 at 12:51:46PM +1100, Stuart Bishop wrote:
> On Saturday, December 7, 2002, at 09:45  PM, Robin Becker wrote:
> >the problem here is that I assume we have to know the document is
> >supposed to be in UTF-8. I haven't looked internally at RXP to see if
> >this is easily know to the parser. It seems odd to me that the DTD can
> >be used in any document. Can a utf-8 doc define itself using a unicode
> >DTD? I think this is the one that will appeal to Andy(in his pointy
> >haired manager incarnation), but I would like to get some agreement on
> >when this hack is allowed.
> From http://www.w3.org/TR/REC-xml#charencoding, it seems that the
> document and all external documents can each use a different
> encoding, and is assumed to be UTF-8 or UTF-16 unless otherwise
> stated (its the parsers job to detect UTF-8 or UTF-16). So you can 
> declare
> a XHTML 1.0 document using us-ascii encoding. I assume this also means
> a validation warning would be raised if you put an œ in this 
> us-ascii
> document (?).

I think not.  validator.w3c.org accepts the following as valid XHTML:

  <?xml version="1.0" encoding="us-ascii"?>
  <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  <html xmlns="http://www.w3.org/1999/xhtml">

The way I understand XML spec, you can have a document composed from a
number of files ("external entities"), and each file can come in a
different charset (and that's not mentioning character references, which
are always specified as Unicode code points).  The easiest way to
combine all that into a single document is to convert everyting into
some encoding of Unicode.

Marius Gedminas
1 4m 5o 3l337! just got r00t on this <a href="">k3wl site</a> j00