[reportlab-users] utf-8 characters

Marius Gedminas reportlab-users@reportlab.com
Mon, 3 May 2004 12:34:18 +0300


--DKU6Jbt7q3WqK7+M
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Mon, May 03, 2004 at 08:34:01AM +0100, Chris Withers wrote:
> David Bourillot wrote:
> >Exception type: exceptions.UnicodeDecodeError
> >Exception message: 'utf8' codec can't decode byte 0xc3 in position 9:
> >unexpected end of data
> >
> >After some little investigation, it's seems to me that when the string is
> >split, it's cut between the two bytes of the encoded character '=E0'
>=20
> That would seem unlikely, but maybe ask on the python list for confirmati=
on.
>=20
> Could it be tha tyou have non-UTF-8 data in your UTF-8 string?

I'm pretty sure the problem is in the line wrapping algorithm used by
Platypus.

There have been plans to ditch Python 1.5.2 support and switch to
unicode objects instead of str objects with UTF-8 data everywhere.
When this is done, this problem will disappear, as there's no way to
split a unicode string incorrectly [1][2].

  [1] AFAIU Python does not use UTF-16 surrogate pairs, right?  If you
      want to use characters outside the BMP, you're supposed to compile
      your Python interpreter with 32-bit Unicode support.

  [2] There are also combining characters that might pose problems with
      line wrapping.  And I'm not talking about BiDi or other exotic
      things that Reportlab does not support yet.

Marius Gedminas
--=20
Stupidity management for the superuser is a user space issue in Unix
systems.
		-- Alan Cox

--DKU6Jbt7q3WqK7+M
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFAlhIakVdEXeem148RAr1cAJ44zHX1nn487pq1dwCAodNvTKQmHQCfVDYo
9mJEaZJg6dcIUDHZORptK7k=
=ULcx
-----END PGP SIGNATURE-----

--DKU6Jbt7q3WqK7+M--