[reportlab-users] Re: pyRXP vs ONIX DTD (and XHTML)

Stuart Bishop reportlab-users@reportlab.com
Mon, 9 Dec 2002 12:51:46 +1100


On Saturday, December 7, 2002, at 09:45  PM, Robin Becker wrote:

> the problem here is that I assume we have to know the document is
> supposed to be in UTF-8. I haven't looked internally at RXP to see if
> this is easily know to the parser. It seems odd to me that the DTD can
> be used in any document. Can a utf-8 doc define itself using a unicode
> DTD? I think this is the one that will appeal to Andy(in his pointy
> haired manager incarnation), but I would like to get some agreement on
> when this hack is allowed.

 From http://www.w3.org/TR/REC-xml#charencoding, it seems that the
document and all external documents can each use a different
encoding, and is assumed to be UTF-8 or UTF-16 unless otherwise
stated (its the parsers job to detect UTF-8 or UTF-16). So you can 
declare
a XHTML 1.0 document using us-ascii encoding. I assume this also means
a validation warning would be raised if you put an œ in this 
us-ascii
document (?).

The internal UTF-8 idea may work, but would involve first translating
the input document & external files into UTF-8, with all &# encodings
expanded (eoCB could do this...). And if this is done, the rest of the
system would remain unmodified and return the parsed document
as UTF-8 strings (either returned as Python Unicode strings by the
glue, or left up to the application to call codecs.utf_8_decode on them
if they think they need to handle extended characters). The trick would
be not expanding any &# sequences that happens to be inside a CDATA
section.

Here is a proof of concept of pyRXP running internally in UTF-8 mode,
done by abusing the eoCB callback. It needs a test suite to ensure that
there isn't some flaw in my reasoning :-). Memory consumption should be
the same. The pre-RXP munging may slow things down enough to
make it quicker to just run RXP in 16 bit mode though - I guess this 
depends
on the size of your source file.

import pyRXP
import urllib,urlparse
import sys,os,tempfile,re,traceback
from pprint import pprint

XHTML=u'''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head><title></title></head>
<body><p>&OElig; &amp; \u0152 are not ASCII</p></body></html>
'''

class _TempFile:
     _files = []
     def new(self,prefix=''):
         name = tempfile.mktemp('%s-utf8' % prefix)
         self._files.append(name)
         return name

     def __del__(self):
         for f in self._files:
             try:
                 os.unlink(f)
             except:
                 pass

tempmaker = _TempFile()

# BUG: This re won't work if the source document is encoded as UTF-16
#            or other encodings that are not ASCII supersets, in case 
anyone
#            is actually silly enough to do this. These encodings will 
need to
#            be detected using magic as per XML 1.0 spec.
enc_re = re.compile("\s*<\?xml[^>]+encoding='([^']+)'[^>]*\?>")
ext_re = re.compile("&#(x?[\da-zA-Z]+?);")

# BUG: Handling of DTD stored relative to cwd untested
base_url = 'file://' + '/'.join(os.path.split(os.getcwd()))

# BUG: Abusing eoCB to munge external documents causes validation error
#      messages to report stupid file names
def eoCB(url):
     # Munge the external thingy and cache the munged version
     #
     try:
         # handle relative URL's
         split = urlparse.urlsplit(url)
         global base_url
         if not split[0]:
             url = urlparse.urljoin(base_url,url)
         else:
             base_url = url

         # First suck in the data and convert it to a Python Unicode 
string
         data = urllib.urlopen(url).read()
         match = enc_re.match(data)
         if match:
             encoding = match.group(1)
         else:
             encoding = 'UTF-8'
         data = data.decode(encoding) # Now a Unicode string

         # Now expand encoded characters as per XML spec
         # BUG: encoded characters in CDATA sections and possibly 
elsewhere
         #      should not be expanded
         def extrepl(match):
             m = match.group(1)
             if m[0] == 'x':
                 m = int(m[1:],16)
             else:
                 m = int(m)

             # BUG: doesn't detect invalid entities.
             #      Need to ensure  Char ::= #x9 | #xA | #xD |
             #        [#x20-#xD7FF] | [#xE000-#xFFFD] | 
[#x10000-#x10FFFF]
             if m >= 128:
                 return unichr(m)
             else:
                 # Pass through encoded ASCII characters, as these may be
                 # required to be encoded due to XML restrictions (eg. 
quote
                 # character)
                 return '&#%s;' % match.group(1)
         data = ext_re.sub(extrepl,data)

         t = tempmaker.new(url.split('/')[-1])
         f = open(t,'w')
         f.write(data.encode('UTF-8'))
         f.close()
         return t
     except:
         print 'Exception in eoCB:'
         traceback.print_exc(file=sys.stdout)


# BUG: Source document needs to be converted from its native charset to 
UTF-8,
#      and &#xxxx; entity expansion done.
parser = pyRXP.Parser(eoCB=eoCB,XMLPredefinedEntities=0)
result = parser.parse(XHTML.encode('utf-8'))
pprint(result) # Strings in result can be decoded using 
foo.decode('utf-8')


>>        - Run RXP in 'ExpandCharacterEntities=0' mode, and have
>>          the glue validate and expand the character entities itself.
>>          Probably a performance nightmare.
>>
> this is what we have been doing, but we are passing the XML fragments
> around into other bits of XML.

I can do this too for my application - I was hoping to catch the case of
someone including a dodgy entity like '&#0000;', '&#xfffe;' or 
'&#x800;',
but I doubt this will actually matter in the real world :-)

-- 
Stuart Bishop <zen@shangri-la.dropbear.id.au>
http://shangri-la.dropbear.id.au/