[reportlab-users] Incorrect character composition
Glenn Linderman
v+python at g.nevcal.com
Fri Apr 17 15:48:00 EDT 2015
On 4/17/2015 7:22 AM, Robin Becker wrote:
> Who is responsible for glyph positioning. I believe it is the font +
> the renderer who is responsible.
I believe you are correct, but from that Safari Books link I referenced
a few emails ago:
<https://www.safaribooksonline.com/library/view/developing-with-pdf/9781449327903/ch04.html>
> This means that many things that developers working in other file
> formats take for granted, such as just putting down Unicode codepoints
> and letting the renderer do all the hard work, have to be done
> manually with PDF.
So it seems to me that while most renderers render to bitmaps at some
resolution, that for PDF files, you need to assume some (probably high)
bitmap resolution, intercept the rendering process, and capture the
glyphs and positions to convert to PDF instructions... where the fonts
may well contain extra glyphs to be used for various composition
techniques, that do not correspond to Unicode codepoints... the above
statement implies to me that all the hard work of rendering has to be
done before encoding to PDF, and that the PDF display tools only convert
curves to a bitmap.
Back in 256-character font days, some fonts allocated codes for up to
six different copies of the diacritical, all with zero width, but
variant side bearings, and called them "capital O <accent>" "capital E
<accent>" "capital I <accent>" (and three more for lowercase), and then
the user/program had to pick the right variety of the accent to go with
the preceding base character (A & E & U being roughly the same width,
generally, to cover all the vowels; but note that some diacritical go
with consonants too)
I'm far from an expert in PDF files, not much more acquainted with font
files, but that quote above from the Safaribooks, together with thinking
about the complexity of transforming the tables in a useful manner, and
doubt that PDF display tools would have freedom to use those tables even
if they existed, and uncertainty about if they are even allowed to be
embedded, makes me think that individual glyphs would have to be
separately positioned by instructions in the PDF file. The next
paragraphs are an exposition of my thought process in arriving at this
conclusion, but someone with more knowledge may well poke holes in it,
and I'd be delighted to be presented with references to documentation
that pokes those holes, that I couldn't find via Google.
The way I interpret that is that PDF display tools really only follow
instructions about placing glyphs on the screen... in particular, since
they need instructions (fairly well documented) to do simple things like
kerning, I would find it surprising if they would not need instructions
to do more complex things like character composition.
On the other hand, having read Marius' reply, I can't say with certainty
that if there were more font tables included, that some particular PDF
display tools may invoke a renderer that would do more work "during
display". On the other hand, with the stated goal that PDF files can be
reproduced identically by any PDF tool, I'm guessing that leaving _any_
work up to the renderer, other than following exact curve descriptions
from a font, would not be compatible with identical reproduction.
I'm not clear on how the font embedding works; Marius hints at the
possibility of rebuilding some of the tables to correspond to renumbered
glyphs, but I've no clue if those tables are allowed to be embedded, or,
if embedded, if they are allowed to be used by the PDF display tools
anyway, for identical reproduction.
I found a reference to some howto guides for a font creation tool, and
it was talking about having alternate glyphs for certain types of uses;
one example was using a shorter accent mark about uppercase (taller)
letters than above lowercase (shorter) letters. I've also heard about
fonts containing "attachment points" (I don't know if that is one of the
available tables, or is metadata per glyph) so that when characters are
combined, they combine based on these attachment points. Whether both
glyphs of a combining pair must have attachment points, or whether only
one, I couldn't say, but the goal would seem to be, for many
diacriticals, to have the diacritical centered above or below the
character... the zero-width thing for diacriticals doesn't achieve that
because the base characters have have different widths, but checking for
the left and right side-bearings may be an alternative determination of
width for accents that want to be centered.
Of course some diacriticals are to be placed to the right or left of the
characters, or connect two characters rather than being placed on one,
making me wonder if there might be different attachment points for
different types of diacriticals; a base character could have,
potentially, up to 6 that I can think of, for diacriticals that should
be centered, left edged, or right edged, times above and below.
When fonts are carved up into little chunks for pre-Unicode PDF font
subsets, many of the base characters may wind up in different font
subsets than the diacriticals that want to attach to them... this makes
me wonder if it is even possible to rework the font tables as Marius
suggested, even if it is legal and could be useful for some? all? PDF
display tools. Maybe some glyphs would have to be repeated in mulitple
subsets?
>
> I wrote the script below to test various diacritic behaviours in
> reportlab.
>
> The TLDR is as follows, the TTF fonts seem to know about diacritics.
> The adobe builtins may or may not know about them, but with our
> standard encoding Helvetica clearly doesn't.
>
> The script draws space + glyph + diacritic for some upper and lower
> case roman letters. It also draws the same after unicode normalization.
>
> Where seen, all the diacritics have zero width. The DejaVuSans font
> seems to do slightly better than Arial in centring the common
> diacritics, where available the composed glyphs (obtained by
> normalization) seem much better.
>
> With no width for centring it would seem we need to examine the curves
> to get any kind of centring right. DejaVu & Arial have some built in
> negative shifts as can be seen by examining the tilde
>
>> C:\tmp>python
>> Python 2.7.8 (default, Jun 30 2014, 16:08:48) [MSC v.1500 64 bit
>> (AMD64)] on win32
>> Type "help", "copyright", "credits" or "license" for more information.
>>>>> from reportlab.pdfbase.pdfmetrics import registerFont
>>>>> from reportlab.pdfbase.ttfonts import TTFont
>>>>> registerFont(TTFont('DejaVuSans','DejaVuSans.ttf'))
>>>>> from reportlab.graphics.charts.textlabels import
>>>>> _text2PathDescription
>>>>> p=_text2PathDescription(u'\u0303',fontName='DejaVuSans',fontSize=2048)
>>>>>
>>>>> p
>> [('moveTo', -518, 1370), (u'lineTo', -575, 1425), (u'curveTo', -589,
>> 1438, -602, 1448, -613, 1454),
>> (u'curveTo', -624, 1460, -634, 1464, -643, 1464), (u'curveTo', -668,
>> 1464, -687, 1452, -699, 1427),
>> (u'curveTo', -711, 1403, -717, 1364, -719, 1309), (u'lineTo', -844,
>> 1309),
>> (u'curveTo', -843, 1399, -825, 1468, -791, 1517), (u'curveTo', -757,
>> 1566, -710, 1591, -649, 1591),
>> (u'curveTo', -624, 1591, -601, 1587, -579, 1577), (u'curveTo', -558,
>> 1568, -535, 1552, -510, 1530),
>> (u'lineTo', -453, 1475), (u'curveTo', -439, 1462, -426, 1452, -414,
>> 1445),
>> (u'curveTo', -404, 1439, -394, 1436, -385, 1436), (u'curveTo', -360,
>> 1436, -341, 1448, -329, 1472),
>> (u'curveTo', -317, 1496, -311, 1536, -309, 1591), (u'lineTo', -184,
>> 1591),
>> (u'curveTo', -185, 1501, -203, 1432, -237, 1382), (u'curveTo', -271,
>> 1334, -318, 1309, -379, 1309),
>> (u'curveTo', -404, 1309, -427, 1313, -449, 1323), (u'curveTo', -470,
>> 1332, -493, 1348, -518, 1370),
>> 'closePath']
>>>>> registerFont(TTFont('Arial','Arial.ttf'))
>>>>> pa=_text2PathDescription(u'\u0303',fontName='Arial',fontSize=2048)
>>>>> pa
>> [('moveTo', -909, 1547), (u'curveTo', -909, 1615, -891, 1670, -853,
>> 1712),
>> (u'curveTo', -816, 1754, -767, 1775, -706, 1775), (u'curveTo', -665,
>> 1775, -609, 1757, -537, 1721),
>> (u'curveTo', -498, 1701, -467, 1691, -443, 1691), (u'curveTo', -403,
>> 1691, -378, 1720, -370, 1778),
>> (u'lineTo', -240, 1778), (u'curveTo', -244, 1626, -309, 1550, -436,
>> 1550),
>> (u'curveTo', -478, 1550, -533, 1568, -602, 1606), (u'curveTo', -646,
>> 1630, -679, 1642, -700, 1642),
>> (u'curveTo', -752, 1642, -778, 1611, -776, 1547), (u'lineTo', -909,
>> 1547), 'closePath']
>>>>>
>
>
>
> ie the curve starts at -518/2048 and goes at least to -844/2048, but
> it's clear no single shift can match the various upper and lower case
> widths that could occur. The arial curve is even more negative.
>
> If a combined glyph is in the font we should use it, I'm not sure we
> even have an api for that; TTFont has charToGlyph unicode-->glyph
> number, but we have code to escape if there are no glyph components
> defined for it so the test is quite hard.
>
> Otherwise, generating a missing combined glyph dynamically is probably
> the way to go, but to do that we need information about how each
> combining character is supposed to be positioned. The alternative is
> to attempt to do the adjustment every time we render text using pdf
> operators; we still need the same information.
>
> #################################################################
> from reportlab.pdfbase.ttfonts import TTFont
> from reportlab.pdfbase.pdfmetrics import registerFont
> from reportlab.pdfgen.canvas import Canvas
> from reportlab.lib.pagesizes import A4 as pagesize
> from reportlab.lib.utils import uniChr
> from unicodedata import normalize as unormalize
> registerFont(TTFont("Arial", "Arial.ttf"))
> registerFont(TTFont("DejaVuSans", "DejaVuSans.ttf"))
>
> c = Canvas('tdiacritics.pdf', pagesize=pagesize)
> y0 = pagesize[1]-12
> for fontName in ('Arial','DejaVuSans','Helvetica'):
> c.setFont(fontName, 10)
> y = y0
> y -= 12
> c.drawString(18,y,fontName)
> for diacritic in range(0x300,0x370):
> if y-24 < 0:
> c.showPage()
> c.setFont(fontName, 10)
> y = y0
> y -= 12
> c.drawString(18,y,fontName)
> y -= 12
> x = 18
> diacritic = uniChr(diacritic)
> c.drawString(x,y,hex(ord(diacritic)))
> x += 40
> u = u' '+diacritic+(u' w=%s'%c.stringWidth(diacritic))
> c.drawString(x,y,u)
> x += max(c.stringWidth(u),40)
> for g in u'AEIOUYaeiouy':
> u = ' '+g+diacritic
> c.drawString(x,y,u)
> x += 20
> c.showPage()
> c.setFont(fontName, 10)
> y = y0
> y -= 12
> c.drawString(18,y,fontName+' normalized')
> for diacritic in range(0x300,0x370):
> if y-24 < 0:
> c.showPage()
> c.setFont(fontName, 10)
> y = y0
> y -= 12
> c.drawString(18,y,fontName+' normalized')
> y -= 12
> x = 18
> diacritic = uniChr(diacritic)
> c.drawString(x,y,hex(ord(diacritic)))
> x += 40
> u = u' '+diacritic+(u' w=%s'%c.stringWidth(diacritic))
> c.drawString(x,y,u)
> x += max(c.stringWidth(u),40)
> for g in u'AEIOUYaeiouy':
> u = unormalize('NFC',' '+g+diacritic)
> c.drawString(x,y,u)
> x += 20
> c.showPage()
> c.save()
> #################################################################
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist2.pair.net/pipermail/reportlab-users/attachments/20150417/2e6d7d3e/attachment-0001.html>
More information about the reportlab-users
mailing list