[reportlab-users] Re: pyRXP vs ONIX DTD (and XHTML)
Mon, 9 Dec 2002 12:51:46 +1100
On Saturday, December 7, 2002, at 09:45 PM, Robin Becker wrote:
> the problem here is that I assume we have to know the document is
> supposed to be in UTF-8. I haven't looked internally at RXP to see if
> this is easily know to the parser. It seems odd to me that the DTD can
> be used in any document. Can a utf-8 doc define itself using a unicode
> DTD? I think this is the one that will appeal to Andy(in his pointy
> haired manager incarnation), but I would like to get some agreement on
> when this hack is allowed.
From http://www.w3.org/TR/REC-xml#charencoding, it seems that the
document and all external documents can each use a different
encoding, and is assumed to be UTF-8 or UTF-16 unless otherwise
stated (its the parsers job to detect UTF-8 or UTF-16). So you can
a XHTML 1.0 document using us-ascii encoding. I assume this also means
a validation warning would be raised if you put an œ in this
The internal UTF-8 idea may work, but would involve first translating
the input document & external files into UTF-8, with all &# encodings
expanded (eoCB could do this...). And if this is done, the rest of the
system would remain unmodified and return the parsed document
as UTF-8 strings (either returned as Python Unicode strings by the
glue, or left up to the application to call codecs.utf_8_decode on them
if they think they need to handle extended characters). The trick would
be not expanding any &# sequences that happens to be inside a CDATA
Here is a proof of concept of pyRXP running internally in UTF-8 mode,
done by abusing the eoCB callback. It needs a test suite to ensure that
there isn't some flaw in my reasoning :-). Memory consumption should be
the same. The pre-RXP munging may slow things down enough to
make it quicker to just run RXP in 16 bit mode though - I guess this
on the size of your source file.
from pprint import pprint
XHTML=u'''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
<body><p>Œ & \u0152 are not ASCII</p></body></html>
_files = 
name = tempfile.mktemp('%s-utf8' % prefix)
for f in self._files:
tempmaker = _TempFile()
# BUG: This re won't work if the source document is encoded as UTF-16
# or other encodings that are not ASCII supersets, in case
# is actually silly enough to do this. These encodings will
# be detected using magic as per XML 1.0 spec.
enc_re = re.compile("\s*<\?xml[^>]+encoding='([^']+)'[^>]*\?>")
ext_re = re.compile("&#(x?[\da-zA-Z]+?);")
# BUG: Handling of DTD stored relative to cwd untested
base_url = 'file://' + '/'.join(os.path.split(os.getcwd()))
# BUG: Abusing eoCB to munge external documents causes validation error
# messages to report stupid file names
# Munge the external thingy and cache the munged version
# handle relative URL's
split = urlparse.urlsplit(url)
if not split:
url = urlparse.urljoin(base_url,url)
base_url = url
# First suck in the data and convert it to a Python Unicode
data = urllib.urlopen(url).read()
match = enc_re.match(data)
encoding = match.group(1)
encoding = 'UTF-8'
data = data.decode(encoding) # Now a Unicode string
# Now expand encoded characters as per XML spec
# BUG: encoded characters in CDATA sections and possibly
# should not be expanded
m = match.group(1)
if m == 'x':
m = int(m[1:],16)
m = int(m)
# BUG: doesn't detect invalid entities.
# Need to ensure Char ::= #x9 | #xA | #xD |
# [#x20-#xD7FF] | [#xE000-#xFFFD] |
if m >= 128:
# Pass through encoded ASCII characters, as these may be
# required to be encoded due to XML restrictions (eg.
return '&#%s;' % match.group(1)
data = ext_re.sub(extrepl,data)
t = tempmaker.new(url.split('/')[-1])
f = open(t,'w')
print 'Exception in eoCB:'
# BUG: Source document needs to be converted from its native charset to
# and &#xxxx; entity expansion done.
parser = pyRXP.Parser(eoCB=eoCB,XMLPredefinedEntities=0)
result = parser.parse(XHTML.encode('utf-8'))
pprint(result) # Strings in result can be decoded using
>> - Run RXP in 'ExpandCharacterEntities=0' mode, and have
>> the glue validate and expand the character entities itself.
>> Probably a performance nightmare.
> this is what we have been doing, but we are passing the XML fragments
> around into other bits of XML.
I can do this too for my application - I was hoping to catch the case of
someone including a dodgy entity like '�', '' or
but I doubt this will actually matter in the real world :-)
Stuart Bishop <firstname.lastname@example.org>