[reportlab-users] [Bitbucket] Issue #24: Missing glyphs from embedded emoji font (Symbola) (rptlab/reportlab)
Marius Gedminas
marius at gedmin.as
Tue Feb 25 17:14:51 EST 2014
On Tue, Feb 25, 2014 at 02:09:39PM +0000, Robin Becker wrote:
> to.textLine(u"Unicode symbols: \u02a4\U0001F631\U0001F64C\U0001F44C")
...
> I ran this on windows with python 3.3.3. The rsult shows the dz
> character \u02a4, but not the three astral plane emoji characters;
> they appear as ? chars.
>
> I looked into the PDF produced by reportlab. We appear to be
> creating the subset map correctly at the end of the definitions I
> see this
>
> <7F> <007F>
> <80> <02A4>
> <81> <1F631>
> <82> <1F64C>
> <83> <1F44C>
>
> so we've seen those characters and allegedly created glyphs for
> them. In the body of the document I see this
>
> (Unicode symbols: \200\201\202\203) Tj T* (UTF8 symbols: \200\201\202\203) Tj T* ET
>
> So we're using the octal escapes for 0x80 0x81 0x82 0x83 in the
> string. From this I can only deduce that either we are failing in
> the glyph creation stage somewhere (ie when building the subset the
> glyph lookup fails) or we're building the subset correctly and
> Acrobat fails to deliver. I suspect the former.
That's very plausible.
> Debugging in Marius' ttfonts.py code reveals that we don't seem to
> read all of the glyphs. At line 641 our unichars lie in
> range(startCount[n],endCount[n]+1) and we are reading startCount &
> endCount with read_ushort() so all our unichars lie in 0<= unichar
> <= 0xffff.
I'm afraid it's been too long, and I don't remember the code. Or the
TTF spec, for that matter.
I wouldn't be surprised to learn that characters outside the Basic
Multilingual Plane require special support.
> Seems there must be some kind of extension to let us read unicodes
> above 0xffff. We're using the first 'unicode' cmap table from
>
> >cmap table 0/4: platFormID=0 encodingID=0 offset=00000024
> >cmap table 1/4: platFormID=1 encodingID=0 offset=00000144
> >cmap table 2/4: platFormID=3 encodingID=1 offset=0000034e
> >cmap table 3/4: platFormID=3 encodingID=10 offset=0000046e
>
> and that appears to be a format 4 table which is what we read. Self
> evidently I'm missing something.
I think you're on the right track.
AFAIR the font files have multiple ways of mapping characters to glyph
numbers. It may be a good idea to look at a recent version of the
TTF/OTF spec and see how non-BMP characters can be mapped to glyphs,
then fix ReportLab's font subsetting code to make sure it can find the
glyphs for these characters.
It sounds like a fun little exercise, but I'm afraid I won't be able to
find time to play with this any time soon. :/
Marius Gedminas
--
The most effective debugging tool is still careful thought, coupled with
judiciously placed print statements.
-- Brian W. Kernighan, in the paper Unix for Beginners (1979)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 190 bytes
Desc: Digital signature
Url : <http://two.pairlist.net/pipermail/reportlab-users/attachments/20140225/05027770/attachment.pgp>
More information about the reportlab-users
mailing list