[reportlab-users] [Bitbucket] Issue #24: Missing glyphs from embedded emoji font (Symbola) (rptlab/reportlab)

Marius Gedminas marius at gedmin.as
Tue Feb 25 17:14:51 EST 2014


On Tue, Feb 25, 2014 at 02:09:39PM +0000, Robin Becker wrote:

> to.textLine(u"Unicode symbols: \u02a4\U0001F631\U0001F64C\U0001F44C")

...

> I ran this on windows with python 3.3.3. The rsult shows the dz

> character \u02a4, but not the three astral plane emoji characters;

> they appear as ? chars.

>

> I looked into the PDF produced by reportlab. We appear to be

> creating the subset map correctly at the end of the definitions I

> see this

>

> <7F> <007F>

> <80> <02A4>

> <81> <1F631>

> <82> <1F64C>

> <83> <1F44C>

>

> so we've seen those characters and allegedly created glyphs for

> them. In the body of the document I see this

>

> (Unicode symbols: \200\201\202\203) Tj T* (UTF8 symbols: \200\201\202\203) Tj T* ET

>

> So we're using the octal escapes for 0x80 0x81 0x82 0x83 in the

> string. From this I can only deduce that either we are failing in

> the glyph creation stage somewhere (ie when building the subset the

> glyph lookup fails) or we're building the subset correctly and

> Acrobat fails to deliver. I suspect the former.


That's very plausible.


> Debugging in Marius' ttfonts.py code reveals that we don't seem to

> read all of the glyphs. At line 641 our unichars lie in

> range(startCount[n],endCount[n]+1) and we are reading startCount &

> endCount with read_ushort() so all our unichars lie in 0<= unichar

> <= 0xffff.


I'm afraid it's been too long, and I don't remember the code. Or the
TTF spec, for that matter.

I wouldn't be surprised to learn that characters outside the Basic
Multilingual Plane require special support.


> Seems there must be some kind of extension to let us read unicodes

> above 0xffff. We're using the first 'unicode' cmap table from

>

> >cmap table 0/4: platFormID=0 encodingID=0 offset=00000024

> >cmap table 1/4: platFormID=1 encodingID=0 offset=00000144

> >cmap table 2/4: platFormID=3 encodingID=1 offset=0000034e

> >cmap table 3/4: platFormID=3 encodingID=10 offset=0000046e

>

> and that appears to be a format 4 table which is what we read. Self

> evidently I'm missing something.


I think you're on the right track.

AFAIR the font files have multiple ways of mapping characters to glyph
numbers. It may be a good idea to look at a recent version of the
TTF/OTF spec and see how non-BMP characters can be mapped to glyphs,
then fix ReportLab's font subsetting code to make sure it can find the
glyphs for these characters.

It sounds like a fun little exercise, but I'm afraid I won't be able to
find time to play with this any time soon. :/

Marius Gedminas
--
The most effective debugging tool is still careful thought, coupled with
judiciously placed print statements.
-- Brian W. Kernighan, in the paper Unix for Beginners (1979)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 190 bytes
Desc: Digital signature
Url : <http://two.pairlist.net/pipermail/reportlab-users/attachments/20140225/05027770/attachment.pgp>


More information about the reportlab-users mailing list