[reportlab-users] Incorrect character composition

Glenn Linderman v+python at g.nevcal.com
Tue Apr 21 03:28:01 EDT 2015


On 4/20/2015 11:14 PM, Andy Robinson wrote:
>> Meantime, seeing your approach of looking at Illustrator output, I had a
>> friend with Acrobat take my little test string and create a PDF from
>> Acrobat.  Results look good, and are at:
>> http://nevcal.com/temporary/openo-Acrobat.pdf  Maybe seeing what they do
>> will help.  File is big enough they must have embedded something or another,
>> font-wise.
> Glenn, could you ask your friend exactly what they did with Acrobat to
> create this?  i.e. did they use Acrobat Distiller to convert a
> Postscript file, or create a word document and export it to PDF using
> Acrobat?  If we can observe another program "doing it right" it may
> help.
Oh, he told me up front, I just didn't think the process mattered so 
much as the results.

He started from my email, selected and then copied the text string into 
the clipboard, and apparently Acrobat has a feature to create a PDF file 
from the clipboard contents, so he used that to create the file.  First 
time through he selected the whole email, but I thought that would just 
add clutter, so asked him to just do the "interesting" text.

He's quite fond of PDF editors, it would seem... he has several.

Using the same process, there is now another file at

http://nevcal.com/temporary/openo-Nuance.pdf

He also tried the just-released Nitro 10, but it failed to create from the clipboard, failed to create from a UTF-8 text file, failed to create from a plain text file, and failed to create from a Word document... he has submitted a bug report, and is probably busy reinstalling the prior version of Nitro.  If Nitro 9 will do the job, that might give another file in the near future.

Actually, the reason he has so many, is that most of them have limitations, some are good for one sort of thing, but mess up on other things. Another will do the other things, but not something else. Etc.  He mostly uses the editing features to fine tune and work-around limitations in the PDF creation from other programs, rather than using them to create raw PDF files from other formats.

So in counting the input characters for my sample, there are 8 base/precomposed characters, and 4 combining diacriticals, for a total of 12.

The text stream from Acrobat is as follows:

15 0 obj
<<
/Length 455
>>
stream
BT
/P <</MCID 0 >>BDC
/CS0 cs 0.2 0 0.2  scn
/TT0 1 Tf
12 -0 0 12 72 709.2 Tm
( )Tj
/C2_0 1 Tf
36 -0 0 36 72 672.72 Tm
<0727>Tj
/TT0 1 Tf
0.443 0 Td
(\343)Tj
/C2_0 1 Tf
0.443 0 Td
<0727>Tj
0.447 -0.17 Td
<047A>Tj
/TT0 1 Tf
-0.003 0.17 Td
(\325)Tj
/C2_0 1 Tf
0.723 0 Td
<0690>Tj
0.557 0.047 Td
<047A>Tj
0.11 -0.047 Td
<072D072D>Tj
0.853 -0.17 Td
<047A>Tj
-0.013 0.17 Td
<0699>Tj
0.473 0.047 Td
<047A>Tj
/TT0 1 Tf
12 -0 0 12 218.16 672.72 Tm
( )Tj
EMC
ET

endstream
endobj

The text stream from Nuance is as follows:

7 0 obj
<<
/Length 600
>>
stream
0.1999 0 0.1999 rg
[]0 d 1 w 10 M 0 i 0 J 0 j
BT
/F0 35.029 Tf
1 0 0 1 28.789 774.789 Tm
0 Tc 0 Tw 0 Tr 100 Tz 0 Ts
(')Tj
1 0 0 1 44.268 774.789 Tm
(\000m)Tj
ET
BT
/F0 35.029 Tf
1 0 0 1 59.868 774.789 Tm
(')Tj
1 0 0 1 75.467 769.03 Tm
(z)Tj
ET
BT
/F1 35.029 Tf
0.9999 0 0 0.9999 75.778 774.782 Tm
( )Tj
ET
BT
/F0 35.029 Tf
1 0 0 1 100.787 774.789 Tm
( )Tj
1 0 0 1 120.106 776.469 Tm
(z)Tj
1 0 0 1 124.066 774.789 Tm
(-)Tj
1 0 0 1 138.825 774.789 Tm
(-)Tj
1 0 0 1 153.825 769.03 Tm
(z)Tj
ET
BT
/F0 35.029 Tf
1 0 0 1 153.585 774.789 Tm
( )Tj
1 0 0 1 170.145 776.469 Tm
(z)Tj
ET

endstream
endobj


I was rather surprised to see that Nuance had control characters in the Tj paramters.  Acrobat has some too, though, but mostly hex-quads.

Not counting the leading and trailing space characters that got included, Acrobat emits 12 characters, which means that it doesn't compose them in the font creation.

I'm really not sure how to count the control characters... most of the Nuance Tj have both a control character and a regular character.  Maybe that is a form of CID mapping?  It emits 23 characters using Tj, unless the control character+regular character pair should be counted as one, in which case it emits 12 characters... which sounds more correct.

What I notice particularly about this compared to other PDF files I have looked at the internals of, is that both Acrobat and Nuance emit text movement operators between _each_ character (except one pair, in the case of Acrobat, which are the sequential ɛ characters, one with and one without a diacritical).  Acrobat uses Td, and Nuance uses Tm.

So my "guessing about a lot of things I haven't figured out" conclusion, 
without knowing how to look at the embedded fonts, is that both Acrobat 
and Nuance are doing the kerning and character composition positioning 
on the way in to the PDF file, rather than expecting the PDF display 
tool renderer to be smart. This is consistent with my guessing after 
reading the quote from that safarionline book.  No clue how they figure 
out the numbers... no doubt it is either from the font files directly, 
using their own rendering code, or from some font rendering library, or 
from Windows somehow. The latter seems doubtful for Acrobat, since it 
also runs on Mac... although that is no guarantee it uses the same 
(recompiled) code on both platforms... it could get it from Windows on 
Windows and from OS/X on Mac.

I didn't attempt to do the math to see what all those Tm and Td 
operations are doing, but they seem to have produced equivalent results, 
visually. I sort of know what a matrix multiply is, but have maybe only 
done one by hand in some math class many years ago, and haven't figured 
out _how_ they scale and move graphics blobs, or what the 6 or 9 numbers 
actually mean in 2D space, but am aware that 2D graphics scaling, 
rotation, skewing, and translation can all be done via matrix math. 
There should be no skewing or rotation for this work, which probably 
simplifies the Tm operand... enough that Acrobat knows how to shoehorn 
it into a Td operators' smaller set of parameters... in fact, quickly 
referring to the documentation for Td and Tm, it seems that   p1 p2 Td   
is simply shorthand for   1 0 0 1 p1 p2 Tm  and I see that all of 
Nuance's Tm parameter lists start with  1 0 0 1, so they are missing an 
space optimization in their PDF generation (well, one of them is .9999 0 
0 .9999, and that is likely a rounding error).

I find it interesting that both tools generated two fonts... there are 
only 8 Unicode codepoints, but I could imagine that the "upper case" and 
"lower case" diacriticals could be separate glyphs, even though they 
look the same, but that would still only be 9. Hardly justification for 
a second font...

In the Acrobat file, it seems from the sequence of character codes used, 
that the /C2_0 font contains the precomposed characters, and all the 
others (base characters, and combining diacriticals) are in the /TT0 
font.  For Nuance, the precomposed Õ seems to be the only one in /F1, 
with the rest all coming from /F0.  Curiously, that one is also the one 
that has the .9999 0 0 .9999 matrix data.

Well, that is about the end of what I can figure from this information.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist2.pair.net/pipermail/reportlab-users/attachments/20150421/270b7397/attachment.html>


More information about the reportlab-users mailing list