[reportlab-users] Incorrect character composition

Fri Apr 17 15:48:00 EDT 2015

On 4/17/2015 7:22 AM, Robin Becker wrote:
> Who is responsible for glyph positioning. I believe it is the font + 
> the renderer who is responsible.

I believe you are correct, but from that Safari Books link I referenced 
a few emails ago:
<https://www.safaribooksonline.com/library/view/developing-with-pdf/9781449327903/ch04.html>

> This means that many things that developers working in other file 
> formats take for granted, such as just putting down Unicode codepoints 
> and letting the renderer do all the hard work, have to be done 
> manually with PDF.

So it seems to me that while most renderers render to bitmaps at some 
resolution, that for PDF files, you need to assume some (probably high) 
bitmap resolution, intercept the rendering process, and capture the 
glyphs and positions to convert to PDF instructions... where the fonts 
may well contain extra glyphs to be used for various composition 
techniques, that do not correspond to Unicode codepoints... the above 
statement implies to me that all the hard work of rendering has to be 
done before encoding to PDF, and that the PDF display tools only convert 
curves to a bitmap.

Back in 256-character font days, some fonts allocated codes for up to 
six different copies of the diacritical, all with zero width, but 
variant side bearings, and called them  "capital O <accent>" "capital E 
<accent>" "capital I <accent>" (and three more for lowercase), and then 
the user/program had to pick the right variety of the accent to go with 
the preceding base character (A & E & U being roughly the same width, 
generally, to cover all the vowels; but note that some diacritical go 
with consonants too)

I'm far from an expert in PDF files, not much more acquainted with font 
files, but that quote above from the Safaribooks, together with thinking 
about the complexity of transforming the tables in a useful manner, and 
doubt that PDF display tools would have freedom to use those tables even 
if they existed, and uncertainty about if they are even allowed to be 
embedded, makes me think that individual glyphs would have to be 
separately positioned by instructions in the PDF file.  The next 
paragraphs are an exposition of my thought process in arriving at this 
conclusion, but someone with more knowledge may well poke holes in it, 
and I'd be delighted to be presented with references to documentation 
that pokes those holes, that I couldn't find via Google.

The way I interpret that is that PDF display tools really only follow 
instructions about placing glyphs on the screen... in particular, since 
they need instructions (fairly well documented) to do simple things like 
kerning, I would find it surprising if they would not need instructions 
to do more complex things like character composition.

On the other hand, having read Marius' reply, I can't say with certainty 
that if there were more font tables included, that some particular PDF 
display tools may invoke a renderer that would do more work "during 
display". On the other hand, with the stated goal that PDF files can be 
reproduced identically by any PDF tool, I'm guessing that leaving _any_ 
work up to the renderer, other than following exact curve descriptions 
from a font, would not be compatible with identical reproduction.

I'm not clear on how the font embedding works; Marius hints at the 
possibility of rebuilding some of the tables to correspond to renumbered 
glyphs, but I've no clue if those tables are allowed to be embedded, or, 
if embedded, if they are allowed to be used by the PDF display tools 
anyway, for identical reproduction.

I found a reference to some howto guides for a font creation tool, and 
it was talking about having alternate glyphs for certain types of uses; 
one example was using a shorter accent mark about uppercase (taller) 
letters than above lowercase (shorter) letters. I've also heard about 
fonts containing "attachment points" (I don't know if that is one of the 
available tables, or is metadata per glyph) so that when characters are 
combined, they combine based on these attachment points.  Whether both 
glyphs of a combining pair must have attachment points, or whether only 
one, I couldn't say, but the goal would seem to be, for many 
diacriticals, to have the diacritical centered above or below the 
character... the zero-width thing for diacriticals doesn't achieve that 
because the base characters have have different widths, but checking for 
the left and right side-bearings may be an alternative determination of 
width for accents that want to be centered.

Of course some diacriticals are to be placed to the right or left of the 
characters, or connect two characters rather than being placed on one, 
making me wonder if there might be different attachment points for 
different types of diacriticals; a base character could have, 
potentially, up to 6 that I can think of, for diacriticals that should 
be centered, left edged, or right edged, times above and below.

When fonts are carved up into little chunks for pre-Unicode PDF font 
subsets, many of the base characters may wind up in different font 
subsets than the diacriticals that want to attach to them... this makes 
me wonder if it is even possible to rework the font tables as Marius 
suggested, even if it is legal and could be useful for some? all? PDF 
display tools. Maybe some glyphs would have to be repeated in mulitple 
subsets?

>
> I wrote the  script below to test various diacritic behaviours in 
> reportlab.
>
> The TLDR is as follows, the TTF fonts seem to know about diacritics. 
> The adobe builtins may or may not know about them, but with our 
> standard encoding Helvetica clearly doesn't.
>
> The script draws space + glyph + diacritic for some upper and lower 
> case roman letters. It also draws the same after unicode normalization.
>
> Where seen, all the diacritics have zero width. The DejaVuSans font 
> seems to do slightly better than Arial in centring the common 
> diacritics, where available the composed glyphs (obtained by 
> normalization) seem much better.
>
> With no width for centring it would seem we need to examine the curves 
> to get any kind of centring right. DejaVu & Arial have some built in 
> negative shifts as can be seen by examining the tilde
>
>> C:\tmp>python
>> Python 2.7.8 (default, Jun 30 2014, 16:08:48) [MSC v.1500 64 bit 
>> (AMD64)] on win32
>> Type "help", "copyright", "credits" or "license" for more information.
>>>>> from reportlab.pdfbase.pdfmetrics import registerFont
>>>>> from reportlab.pdfbase.ttfonts import TTFont
>>>>> registerFont(TTFont('DejaVuSans','DejaVuSans.ttf'))
>>>>> from reportlab.graphics.charts.textlabels import 
>>>>> _text2PathDescription
>>>>> p=_text2PathDescription(u'\u0303',fontName='DejaVuSans',fontSize=2048) 
>>>>>
>>>>> p
>> [('moveTo', -518, 1370), (u'lineTo', -575, 1425), (u'curveTo', -589, 
>> 1438, -602, 1448, -613, 1454),
>> (u'curveTo', -624, 1460, -634, 1464, -643, 1464), (u'curveTo', -668, 
>> 1464, -687, 1452, -699, 1427),
>> (u'curveTo', -711, 1403, -717, 1364, -719, 1309), (u'lineTo', -844, 
>> 1309),
>> (u'curveTo', -843, 1399, -825, 1468, -791, 1517), (u'curveTo', -757, 
>> 1566, -710, 1591, -649, 1591),
>> (u'curveTo', -624, 1591, -601, 1587, -579, 1577), (u'curveTo', -558, 
>> 1568, -535, 1552, -510, 1530),
>> (u'lineTo', -453, 1475), (u'curveTo', -439, 1462, -426, 1452, -414, 
>> 1445),
>> (u'curveTo', -404, 1439, -394, 1436, -385, 1436), (u'curveTo', -360, 
>> 1436, -341, 1448, -329, 1472),
>> (u'curveTo', -317, 1496, -311, 1536, -309, 1591), (u'lineTo', -184, 
>> 1591),
>> (u'curveTo', -185, 1501, -203, 1432, -237, 1382), (u'curveTo', -271, 
>> 1334, -318, 1309, -379, 1309),
>> (u'curveTo', -404, 1309, -427, 1313, -449, 1323), (u'curveTo', -470, 
>> 1332, -493, 1348, -518, 1370),
>> 'closePath']
>>>>> registerFont(TTFont('Arial','Arial.ttf'))
>>>>> pa=_text2PathDescription(u'\u0303',fontName='Arial',fontSize=2048)
>>>>> pa
>> [('moveTo', -909, 1547), (u'curveTo', -909, 1615, -891, 1670, -853, 
>> 1712),
>> (u'curveTo', -816, 1754, -767, 1775, -706, 1775), (u'curveTo', -665, 
>> 1775, -609, 1757, -537, 1721),
>> (u'curveTo', -498, 1701, -467, 1691, -443, 1691), (u'curveTo', -403, 
>> 1691, -378, 1720, -370, 1778),
>> (u'lineTo', -240, 1778), (u'curveTo', -244, 1626, -309, 1550, -436, 
>> 1550),
>> (u'curveTo', -478, 1550, -533, 1568, -602, 1606), (u'curveTo', -646, 
>> 1630, -679, 1642, -700, 1642),
>> (u'curveTo', -752, 1642, -778, 1611, -776, 1547), (u'lineTo', -909, 
>> 1547), 'closePath']
>>>>>
>
>
>
> ie the curve starts at -518/2048 and goes at least to -844/2048, but 
> it's clear no single shift can match the various upper and lower case 
> widths that could occur. The arial curve is even more negative.
>
> If a combined glyph is in the font we should use it, I'm not sure we 
> even have an api for that; TTFont has charToGlyph unicode-->glyph 
> number, but we have code to escape if there are no glyph components 
> defined for it so the test is quite hard.
>
> Otherwise, generating a missing combined glyph dynamically is probably 
> the way to go, but to do that we need information about how each 
> combining character is supposed to be positioned. The alternative is 
> to attempt to do the adjustment every time we render text using pdf 
> operators; we still need the same information.
>
> #################################################################
> from reportlab.pdfbase.ttfonts import TTFont
> from reportlab.pdfbase.pdfmetrics import registerFont
> from reportlab.pdfgen.canvas import Canvas
> from reportlab.lib.pagesizes import A4 as pagesize
> from reportlab.lib.utils import uniChr
> from unicodedata import normalize as unormalize
> registerFont(TTFont("Arial", "Arial.ttf"))
> registerFont(TTFont("DejaVuSans", "DejaVuSans.ttf"))
>
> c = Canvas('tdiacritics.pdf', pagesize=pagesize)
> y0 = pagesize[1]-12
> for fontName in ('Arial','DejaVuSans','Helvetica'):
>     c.setFont(fontName, 10)
>     y = y0
>     y -= 12
>     c.drawString(18,y,fontName)
>     for diacritic in range(0x300,0x370):
>         if y-24 < 0:
>             c.showPage()
>             c.setFont(fontName, 10)
>             y = y0
>             y -= 12
>             c.drawString(18,y,fontName)
>         y -= 12
>         x = 18
>         diacritic = uniChr(diacritic)
>         c.drawString(x,y,hex(ord(diacritic)))
>         x += 40
>         u = u' '+diacritic+(u' w=%s'%c.stringWidth(diacritic))
>         c.drawString(x,y,u)
>         x += max(c.stringWidth(u),40)
>         for g in u'AEIOUYaeiouy':
>             u = ' '+g+diacritic
>             c.drawString(x,y,u)
>             x += 20
>     c.showPage()
>     c.setFont(fontName, 10)
>     y = y0
>     y -= 12
>     c.drawString(18,y,fontName+' normalized')
>     for diacritic in range(0x300,0x370):
>         if y-24 < 0:
>             c.showPage()
>             c.setFont(fontName, 10)
>             y = y0
>             y -= 12
>             c.drawString(18,y,fontName+' normalized')
>         y -= 12
>         x = 18
>         diacritic = uniChr(diacritic)
>         c.drawString(x,y,hex(ord(diacritic)))
>         x += 40
>         u = u' '+diacritic+(u' w=%s'%c.stringWidth(diacritic))
>         c.drawString(x,y,u)
>         x += max(c.stringWidth(u),40)
>         for g in u'AEIOUYaeiouy':
>             u = unormalize('NFC',' '+g+diacritic)
>             c.drawString(x,y,u)
>             x += 20
>     c.showPage()
> c.save()
> #################################################################

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist2.pair.net/pipermail/reportlab-users/attachments/20150417/2e6d7d3e/attachment-0001.html>