[reportlab-users] Incorrect character composition

Thu Apr 16 01:46:58 EDT 2015

On 4/15/2015 2:02 AM, Andy Robinson wrote:
> Glenn,  my apologies - I had assumed you were discussing "unusual
> languages" without re-reading the original email carefully.  It might
> not be that bad.
>
> There are two things we could do in the short term, and I'm keen to
> keep the core library moving forwards:
>
> (1) We could potentially provide a special flowable for kerned titles
> and short phrases.  This would of course have to render a glyph at a
> time in Python, doing the lookups and calculations

When writing to fixed resolution devices, various fonts have hints for 
use at low resolution, and when rendering the font and character spacing 
it varies.  I don't know if PDF supports that directly, but I noticed 
when printing from a browser to a PDF printer, that the character 
spacing was weird in the printed result. When I told the browser to 
scale everything up really high, and then the browser's printer driver 
to scale to fit the page, the weird character spacing went away.  So in 
producing PDF for typesetting, it is best to ignore the "hints" for low 
resolution devices.  Of course screens are some of the lowest resolution 
devices, and that is what browsers aim at, mostly. Printing is sort of a 
side effect.

My data would be mostly a short word, or up to 3 lines of outdented 
text, without right justification.

> (2) If you can find another open source PDF generator in any language
> which gets it right, and let us know, we can study a "hello world" PDF
> out of that tool and see what it does.   This would be a big time
> saver.

There are, I think, 4 issues, the first two of which I could definitely 
use if implemented, and which sound relatively easy, but likely have 
performance impact. They would enable _higher quality typesetting_ of 
Latin-based text into PDF files. The others could be hard, but would be 
required to support a wider range of languages with non-Latin fonts.  I 
did read something recently about Micro$oft producing a font layout 
system (but they used a different word in the article that I cannot come 
up with right now) for all the various needs of different language 
systems... The closest thing I can find with Google right now is their 
DirectWrite, but whether it incorporates the technology I read about, I 
couldn't say, but maybe it does or will. I don't recall if this was 
something they were making generally available to make the world's 
typography improve, or if it was a proprietary come-on to 
promote/improve Windows. It sounded pretty general, language-wise.

 1. kerning
 2. composite glyph positioning
 3. Languages with huge numbers of ligatures, where characters appear
    differently, even to the point of requiring different glyphs, at the
    beginning or end of words (Arabic) or adjacent to other letters (Thai).
 4. RTL languages.

1. kerning

My research into kerning is below, since it was somewhat productive. 
Most of it was on this list. I have not had time to research composite 
glyph positioning, which

Here's a reference to how to emit kerning into a PDF file: 
http://stackoverflow.com/questions/18304954/how-is-kerning-encoded-on-embedded-adobe-type-1-fonts-in-pdf-files

On this mailing list, the following messages are about kerning, and the 
last two have sample PDF files that claim to have kerning. Seems like 
perhaps integrating Henning's Wordaxe kerning code into reportlab itself 
might make it easier to integrate and make it work with floawables. 
Anyway, it is a start.

From: Henning von Bargen <H.vonBargen at t-p.com>
Date: Tue, 6 Jan 2015 07:16:15 +0000

> Wordaxe does support automatic hyphenation and kerning.
>
> See the SVN trunk (current revision is 110) at
> http://sourceforge.net/p/deco-cow/code/HEAD/tree/trunk/
>
> However, I failed to make it work with RL's ImageAndFlowables class.
> That's why I did not release an official new version.
>
> For an example with kerning support, see the file
> http://sourceforge.net/p/deco-cow/code/HEAD/tree/trunk/tests/test_truetype.py
>
> I agree with Andy that kerning slows the paragraph-wrapping process down,
> so personally I would only use it for headings and title, not for the
> main text content.

From: Dinu Gherman <gherman at darwin.in-berlin.de>
Date: Tue, 6 Jan 2015 11:37:40 +0100

From: Dinu Gherman <gherman at darwin.in-berlin.de>
Date: Tue, 6 Jan 2015 11:39:30 +0100

From: Dinu Gherman <gherman at darwin.in-berlin.de>
Date: Tue, 6 Jan 2015 11:40:57 +0100

2. Composite glyph positioning

Regarding composite characters made from multiple glyphs, the only 
scheme I can now find to adjust Y position is described at the very end 
of this link: 
https://www.safaribooksonline.com/library/view/developing-with-pdf/9781449327903/ch04.html 
That shows the use of Td operator to do both X & Y position between 
glyphs, but doesn't show how to calculate X & Y from font metrics. It 
would seem that only linear kerning was a concern and was optimized in 
operators when the PDF format was designed (since it predates Unicode). 
The idea of composing glyphs on the fly probably hadn't crossed any 
English-speaking minds, back then. The first couple paragraphs at that 
link hint at that likelihood.

Speculation: Maybe there is some mechanism to create composite glyphs 
from the individual glyphs for the composite character codes, and embed 
that composite glyph in the PDF and use its internal code instead of 
positioning them in the stream via the Td operator... but I haven't 
found that... only a few things that seemed to hint at it. While Unicode 
didn't do that, because of the character code explosion that would 
result, any given PDF only needs to deal with the characters (individual 
or composite) actually used in any particular document. So there _might_ 
be a tradeoff between complexity of font embedding versus the complexity 
of font display.

Maybe somewhat unrelated to the above issues, but interesting:

I also just found 
http://www.linuxfoundation.org/images/8/80/Textextraction_slides_small.pdf 
which is rather interesting... a bit short on details of how, but looks 
like it would be appropriate and useful when generating PDF files to use 
the "ToUnicode" feature, whatever it is... I seem to have found it in 
section 5.9.2 of the 1.7 version of the PDF reference, although I 
haven't absorbed it yet.

>
> - Andy
> _______________________________________________
> reportlab-users mailing list
> reportlab-users at lists2.reportlab.com
> https://pairlist2.pair.net/mailman/listinfo/reportlab-users
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist2.pair.net/pipermail/reportlab-users/attachments/20150415/991ff0a7/attachment.html>