[reportlab-users] Re: pyRXP vs ONIX DTD (and XHTML)
Stuart Bishop
reportlab-users@reportlab.com
Mon, 9 Dec 2002 12:51:46 +1100
On Saturday, December 7, 2002, at 09:45 PM, Robin Becker wrote:
> the problem here is that I assume we have to know the document is
> supposed to be in UTF-8. I haven't looked internally at RXP to see if
> this is easily know to the parser. It seems odd to me that the DTD can
> be used in any document. Can a utf-8 doc define itself using a unicode
> DTD? I think this is the one that will appeal to Andy(in his pointy
> haired manager incarnation), but I would like to get some agreement on
> when this hack is allowed.
From http://www.w3.org/TR/REC-xml#charencoding, it seems that the
document and all external documents can each use a different
encoding, and is assumed to be UTF-8 or UTF-16 unless otherwise
stated (its the parsers job to detect UTF-8 or UTF-16). So you can
declare
a XHTML 1.0 document using us-ascii encoding. I assume this also means
a validation warning would be raised if you put an œ in this
us-ascii
document (?).
The internal UTF-8 idea may work, but would involve first translating
the input document & external files into UTF-8, with all &# encodings
expanded (eoCB could do this...). And if this is done, the rest of the
system would remain unmodified and return the parsed document
as UTF-8 strings (either returned as Python Unicode strings by the
glue, or left up to the application to call codecs.utf_8_decode on them
if they think they need to handle extended characters). The trick would
be not expanding any &# sequences that happens to be inside a CDATA
section.
Here is a proof of concept of pyRXP running internally in UTF-8 mode,
done by abusing the eoCB callback. It needs a test suite to ensure that
there isn't some flaw in my reasoning :-). Memory consumption should be
the same. The pre-RXP munging may slow things down enough to
make it quicker to just run RXP in 16 bit mode though - I guess this
depends
on the size of your source file.
import pyRXP
import urllib,urlparse
import sys,os,tempfile,re,traceback
from pprint import pprint
XHTML=u'''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head><title></title></head>
<body><p>Œ & \u0152 are not ASCII</p></body></html>
'''
class _TempFile:
_files = []
def new(self,prefix=''):
name = tempfile.mktemp('%s-utf8' % prefix)
self._files.append(name)
return name
def __del__(self):
for f in self._files:
try:
os.unlink(f)
except:
pass
tempmaker = _TempFile()
# BUG: This re won't work if the source document is encoded as UTF-16
# or other encodings that are not ASCII supersets, in case
anyone
# is actually silly enough to do this. These encodings will
need to
# be detected using magic as per XML 1.0 spec.
enc_re = re.compile("\s*<\?xml[^>]+encoding='([^']+)'[^>]*\?>")
ext_re = re.compile("&#(x?[\da-zA-Z]+?);")
# BUG: Handling of DTD stored relative to cwd untested
base_url = 'file://' + '/'.join(os.path.split(os.getcwd()))
# BUG: Abusing eoCB to munge external documents causes validation error
# messages to report stupid file names
def eoCB(url):
# Munge the external thingy and cache the munged version
#
try:
# handle relative URL's
split = urlparse.urlsplit(url)
global base_url
if not split[0]:
url = urlparse.urljoin(base_url,url)
else:
base_url = url
# First suck in the data and convert it to a Python Unicode
string
data = urllib.urlopen(url).read()
match = enc_re.match(data)
if match:
encoding = match.group(1)
else:
encoding = 'UTF-8'
data = data.decode(encoding) # Now a Unicode string
# Now expand encoded characters as per XML spec
# BUG: encoded characters in CDATA sections and possibly
elsewhere
# should not be expanded
def extrepl(match):
m = match.group(1)
if m[0] == 'x':
m = int(m[1:],16)
else:
m = int(m)
# BUG: doesn't detect invalid entities.
# Need to ensure Char ::= #x9 | #xA | #xD |
# [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]
if m >= 128:
return unichr(m)
else:
# Pass through encoded ASCII characters, as these may be
# required to be encoded due to XML restrictions (eg.
quote
# character)
return '&#%s;' % match.group(1)
data = ext_re.sub(extrepl,data)
t = tempmaker.new(url.split('/')[-1])
f = open(t,'w')
f.write(data.encode('UTF-8'))
f.close()
return t
except:
print 'Exception in eoCB:'
traceback.print_exc(file=sys.stdout)
# BUG: Source document needs to be converted from its native charset to
UTF-8,
# and &#xxxx; entity expansion done.
parser = pyRXP.Parser(eoCB=eoCB,XMLPredefinedEntities=0)
result = parser.parse(XHTML.encode('utf-8'))
pprint(result) # Strings in result can be decoded using
foo.decode('utf-8')
>> - Run RXP in 'ExpandCharacterEntities=0' mode, and have
>> the glue validate and expand the character entities itself.
>> Probably a performance nightmare.
>>
> this is what we have been doing, but we are passing the XML fragments
> around into other bits of XML.
I can do this too for my application - I was hoping to catch the case of
someone including a dodgy entity like '�', '' or
'ࠀ',
but I doubt this will actually matter in the real world :-)
--
Stuart Bishop <zen@shangri-la.dropbear.id.au>
http://shangri-la.dropbear.id.au/