[reportlab-users] Incorrect character composition

Mon Apr 20 06:54:50 EDT 2015

On 4/20/2015 2:20 AM, Robin Becker wrote:
> ........
>> The problem is that ReportLab doesn't embed the font directly.  Instead
>> it constructs multiple subsets (each with < 256 codepoints), and those
>> subsets constructed by ReportLab do not have GPOS information (check the
>> TTFontFile.makeSubset method to see what TTF tables are copied and how
>> they're transformed; my apologies about the terrible code you'll find
>> therein).
>>
>> The GPOS table cannot be copied directly: subsetting changes glyph
>> numbering, so the GPOS table would have to be taken apart and
>> reconstructed with the renumbered glyphs.
>>
>
> well I guess the way to go is
>
> 1) try an experiment to see if PDF renderers will accept the GPOS 
> information in a specific font and make good use of it. I guess we can 
> use illustrator or equivalent to make a sample document. Examining the 
> dejaVuSans font shows it certainly has GPOS information.

Maybe. The attempt will also be instructive regarding how Illustrator 
might handle such combined characters... if it does (I don't have 
Illustrator to test with, but since it is from Adobe, it well might)... 
and what the generated PDF looks like... if it contains positioning 
instructions, or depends on the PDF display tools to have a good renderer.

>
> 2) If the answer to 1 is yes then we'll need to parse the GPOS 
> information and construct subsets that keep the required pairs 
> together. From my understanding of the way PDF uses text I see little 
> hope of constructing a single font that does this for all glyphs in a 
> simple way (section 3.2.3 of the 1.7 PDF spec says "A string object 
> consists of a series of bytes—unsigned integer values in the range 0 
> to 255"), so we're apparently limited to encodings of length 256 or 
> less. Presumably we'll have to be really smart about constructing our 
> encodings if many glyph+diacritic pairs are used.

If #2 applies, such an analysis of encodings is probably best done after 
seeing all the combinations used in the file.  Would it make sense to 
have an iteration inside build() just to collect all the characters used 
in a document for such an analysis? I've really no clue at what 
iteration the current font subset generation takes place, whether it is 
first, last or somewhere in the middle... nor do I have a clue if more 
characters get added in various phases due to repagination, etc.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist2.pair.net/pipermail/reportlab-users/attachments/20150420/e2ffd135/attachment.html>