[reportlab-users] Khmer script not correctly rendered

Fri Aug 23 06:50:46 EDT 2024

Hi Matthias,

Thanks very much for your work on this.  We've been looking into it over 
the summer and making some progress.  Could you contact me on 
andy at reportlab.com when it's convenient, please?

- Andy

On Saturday 22 June 2024 at 06:20:41 UTC+1 matthia... at gmail.com wrote:

The investigation of this phenomena continued in my GitHub project in this 
issue: https://github.com/kreier/timeline/issues/35 

It seems this problem could be solved if a font shape engine like harfbuzz 
would be integrated into reportlab. For simpler combined characters and for 
arabic (with import arabic_reshaper and  reshaped = 
arabic_reshaper.reshape(exam_name) ) this is already done in reportlab. And 
continuing on a post in this forum here from 2005 
<https://groups.google.com/g/reportlab-users/c/scxAhaReanI/m/IYSaDfoH9ZkJ> 
Andy noted:

*> We are trying to work out the right font descriptors and sequences of 
bytes to put in the PDF file so that the right stuff magically happens on 
screen.*

And I think with harfbuzz this would actually be possible. Going back to 
the example mentioned above (and in my issue 35) if we use the Khmer word 
for years ឆ្នាំ it is represented by five unicode codepoints: '\u1786\u17D2\u1793\u17B6\u17C6'. 
But the codepoints to be inserted in the PDF to point to the right glyph 
points is *uni178617B6, uni17D21793* and *uni17C6*. While the last looks 
like the same, the others are actually not Unicode code points but points 
in the font file for these specific ligatures. And we need a little more 
information about by how much our "cursor" should move forward after the 
glyph (first one has a width of 923, the others have zero) and how the 
glyphs should be positioned relative to the first glyph. These information 
would be integrated into the stream for the pdf file (I don't know how this 
stream is generated :( in reportlab) but all the required information is 
given by harfbuzz. 

I'm not sure if functions like *instanceStringWidthTTF* 
<https://github.com/MrBitBucket/reportlab-mirror/blob/master/src/reportlab/lib/rl_accel.py#L106> 
would work since they have a *utf-8* encoded string as text argument, but *uni178617B6 
*and* uni17D21793* are not Unicode codepoints and therefore not represented 
in utf-8. It's probably a lot of work. But it looked like @replarobin Robin 
Becker was interested in starting this project. I still got no response for 
signing up to the official mail list and can't post there, so I have this 
little update here.

Finally a little visual how the font shaping would work, replacing the five 
Unicode code points with three glyph code points for the example above:

[image: khmer_shape.png]

On Thursday 30 May 2024 at 20:16:08 UTC+7 Matthias Kreier wrote:

Digging around in the last 10 days gave me a better understanding of the 
problem. Some languages, scripts and glyphs need replacement 
tables/ligatures to proper render the intended text written in the unicode 
sequence. It is not an easy task. As I found in a Microsoft document there 
are some 634 language tags 
<https://learn.microsoft.com/en-us/typography/opentype/spec/languagetags> 
in software supported to properly render these languages in one of 173 
scripts 
<https://learn.microsoft.com/en-us/typography/opentype/spec/scripttags>. 
Luckily most of the heavy lifting is already done or a constant process of 
refinement - namely Fonttools <https://github.com/fonttools/fonttools> and 
Harfbuzz <https://github.com/harfbuzz/harfbuzz>. Another project for 
creating pdf documents with python fpdf2 <https://py-pdf.github.io/fpdf2/> 
solved this problem 2022 with the inclusion of the mentioned tools 
<https://github.com/py-pdf/fpdf2/pull/477>. It might be an option for 
reportlab, given the required manpower (from the company or community).

I documented my findings here <https://github.com/kreier/timeline/issues/35>. 
I know the implementation of the ligature rendering process will require 
some time and work. Yet otherwise I might have to shift to another python 
base for my project. Andy probably knows what's best for his company.

Here some example code that solves the problem:

# example rendering Khmer
from fpdf import FPDF
pdf = FPDF(orientation="P", unit="mm", format="A4")
pdf.add_page()
pdf.add_font("noto", style="", fname="../../fonts/NotoKhmer.ttf")
pdf.set_font('noto', size=32)
pdf.cell(text="King        - ស្តេច", new_x="LMARGIN", new_y="NEXT")
pdf.cell(text="Prophet - ហោរា",     new_x="LMARGIN", new_y="NEXT")
pdf.set_font("Helvetica", size=12)
pdf.cell(h = 20,text="Now using __text_shaping__ with **uharfbuzz**:", 
markdown=True, new_x="LMARGIN", new_y="NEXT")
pdf.set_font("noto", size=32)
pdf.set_text_shaping(use_shaping_engine=True, script="khmr", language="khm")
pdf.cell(text="King        - ស្តេច", new_x="LMARGIN", new_y="NEXT")
pdf.cell(text="Prophet - ហោរា",     new_x="LMARGIN", new_y="NEXT")
pdf.output("example_fpdf.pdf")

And the output:

[image: Screenshot 2024-05-30 201445.png]

On Tuesday 21 May 2024 at 13:28:19 UTC+7 Matthias Kreier wrote:

Here is a simple example for the syllable ssa "ស្ស". In reportlab the 
result is [image: Screenshot 2024-05-21 132037.png]. Some further 
explanation:

The Khmer syllable "ស្ស" (ssa) consists of a base consonant followed by a 
subscript consonant. Here’s a detailed breakdown of the Unicode sequence:

1. *Base Consonant:* ស (SA) - U+179F
2. *Subscript Consonant:* ្ស (subscript SA) - U+17D2 (KHMER SIGN COENG) + 
U+179F (subscript form of SA)

*Unicode Sequence*

1.* Base Consonant: *
   - U+179F (ស)

2.* Subscript Consonant: *
   - U+17D2 (KHMER SIGN COENG) 
   - U+179F (subscript form of SA)

*Full Unicode Sequence*

Putting these together, the full Unicode sequence for "ស្ស" is:

U+179F (ស) 
U+17D2 (្) 
U+179F (្ស)

*UTF-8 Encoding*

To represent this sequence in UTF-8, each Unicode code point is converted 
to its corresponding UTF-8 byte sequence:

- *U+179F* (ស) in UTF-8: E1 9E 9F
- *U+17D2* (្) in UTF-8: E1 9F 92
- *U+179F* (subscript SA) in UTF-8: E1 9E 9F

*Full UTF-8 Sequence*

Combining these, the UTF-8 encoding for the sequence "ស្ស" is:

*E1 9E 9F E1 9F 92 E1 9E 9F*

*Rendering Process*

1. *Base Consonant:* The rendering engine identifies the base consonant ស 
(U+179F).
2. *Subscript Consonant:* It recognizes the subscript sign (KHMER SIGN 
COENG, U+17D2) and attaches the following consonant to the base consonant 
in its subscript form.
3. *Combination:* The engine renders the subscript consonant properly 
positioned under the base consonant.

In summary, the Unicode sequence for "ស្ស" involves a base consonant 
followed by a subscript sign and another consonant, encoded and rendered 
according to the rules of the Khmer script. The UTF-8 encoding ensures each 
character is correctly represented in byte form, which the rendering engine 
interprets to display the correct combined character.

On Tuesday 21 May 2024 at 01:54:47 UTC+7 Matthias Kreier wrote:

I use Khmer script in my project and have the text in utf-8 and use the 
Noto Sans Khmer ttf font file. For comparison I have the same text with the 
same font in Word (left) and as result from a reportlab run (right):

[image: Screenshot 2024-05-21 015018.png]
The text should be: 
54មនុស្ស
12ចៅក្រម
19ហោរា
53ស្តេច
82រយៈពេល
37ព្រឹត្តិការណ៍
18វត្ថុឬវត្ថុ

The font is: https://fonts.google.com/noto/specimen/Noto+Sans+Khmer 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist2.pair.net/pipermail/reportlab-users/attachments/20240823/8126698c/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: khmer_shape.png
Type: image/png
Size: 36850 bytes
Desc: not available
URL: <https://pairlist2.pair.net/pipermail/reportlab-users/attachments/20240823/8126698c/attachment-0001.png>