[reportlab-users] Khmer script not correctly rendered
Andy Robinson
andy at reportlab.com
Fri Aug 23 06:50:46 EDT 2024
Hi Matthias,
Thanks very much for your work on this. We've been looking into it over
the summer and making some progress. Could you contact me on
andy at reportlab.com when it's convenient, please?
- Andy
On Saturday 22 June 2024 at 06:20:41 UTC+1 matthia... at gmail.com wrote:
The investigation of this phenomena continued in my GitHub project in this
issue: https://github.com/kreier/timeline/issues/35
It seems this problem could be solved if a font shape engine like harfbuzz
would be integrated into reportlab. For simpler combined characters and for
arabic (with import arabic_reshaper and reshaped =
arabic_reshaper.reshape(exam_name) ) this is already done in reportlab. And
continuing on a post in this forum here from 2005
<https://groups.google.com/g/reportlab-users/c/scxAhaReanI/m/IYSaDfoH9ZkJ>
Andy noted:
*> We are trying to work out the right font descriptors and sequences of
bytes to put in the PDF file so that the right stuff magically happens on
screen.*
And I think with harfbuzz this would actually be possible. Going back to
the example mentioned above (and in my issue 35) if we use the Khmer word
for years ឆ្នាំ it is represented by five unicode codepoints: '\u1786\u17D2\u1793\u17B6\u17C6'.
But the codepoints to be inserted in the PDF to point to the right glyph
points is *uni178617B6, uni17D21793* and *uni17C6*. While the last looks
like the same, the others are actually not Unicode code points but points
in the font file for these specific ligatures. And we need a little more
information about by how much our "cursor" should move forward after the
glyph (first one has a width of 923, the others have zero) and how the
glyphs should be positioned relative to the first glyph. These information
would be integrated into the stream for the pdf file (I don't know how this
stream is generated :( in reportlab) but all the required information is
given by harfbuzz.
I'm not sure if functions like *instanceStringWidthTTF*
<https://github.com/MrBitBucket/reportlab-mirror/blob/master/src/reportlab/lib/rl_accel.py#L106>
would work since they have a *utf-8* encoded string as text argument, but *uni178617B6
*and* uni17D21793* are not Unicode codepoints and therefore not represented
in utf-8. It's probably a lot of work. But it looked like @replarobin Robin
Becker was interested in starting this project. I still got no response for
signing up to the official mail list and can't post there, so I have this
little update here.
Finally a little visual how the font shaping would work, replacing the five
Unicode code points with three glyph code points for the example above:
[image: khmer_shape.png]
On Thursday 30 May 2024 at 20:16:08 UTC+7 Matthias Kreier wrote:
Digging around in the last 10 days gave me a better understanding of the
problem. Some languages, scripts and glyphs need replacement
tables/ligatures to proper render the intended text written in the unicode
sequence. It is not an easy task. As I found in a Microsoft document there
are some 634 language tags
<https://learn.microsoft.com/en-us/typography/opentype/spec/languagetags>
in software supported to properly render these languages in one of 173
scripts
<https://learn.microsoft.com/en-us/typography/opentype/spec/scripttags>.
Luckily most of the heavy lifting is already done or a constant process of
refinement - namely Fonttools <https://github.com/fonttools/fonttools> and
Harfbuzz <https://github.com/harfbuzz/harfbuzz>. Another project for
creating pdf documents with python fpdf2 <https://py-pdf.github.io/fpdf2/>
solved this problem 2022 with the inclusion of the mentioned tools
<https://github.com/py-pdf/fpdf2/pull/477>. It might be an option for
reportlab, given the required manpower (from the company or community).
I documented my findings here <https://github.com/kreier/timeline/issues/35>.
I know the implementation of the ligature rendering process will require
some time and work. Yet otherwise I might have to shift to another python
base for my project. Andy probably knows what's best for his company.
Here some example code that solves the problem:
# example rendering Khmer
from fpdf import FPDF
pdf = FPDF(orientation="P", unit="mm", format="A4")
pdf.add_page()
pdf.add_font("noto", style="", fname="../../fonts/NotoKhmer.ttf")
pdf.set_font('noto', size=32)
pdf.cell(text="King - ស្តេច", new_x="LMARGIN", new_y="NEXT")
pdf.cell(text="Prophet - ហោរា", new_x="LMARGIN", new_y="NEXT")
pdf.set_font("Helvetica", size=12)
pdf.cell(h = 20,text="Now using __text_shaping__ with **uharfbuzz**:",
markdown=True, new_x="LMARGIN", new_y="NEXT")
pdf.set_font("noto", size=32)
pdf.set_text_shaping(use_shaping_engine=True, script="khmr", language="khm")
pdf.cell(text="King - ស្តេច", new_x="LMARGIN", new_y="NEXT")
pdf.cell(text="Prophet - ហោរា", new_x="LMARGIN", new_y="NEXT")
pdf.output("example_fpdf.pdf")
And the output:
[image: Screenshot 2024-05-30 201445.png]
On Tuesday 21 May 2024 at 13:28:19 UTC+7 Matthias Kreier wrote:
Here is a simple example for the syllable ssa "ស្ស". In reportlab the
result is [image: Screenshot 2024-05-21 132037.png]. Some further
explanation:
The Khmer syllable "ស្ស" (ssa) consists of a base consonant followed by a
subscript consonant. Here’s a detailed breakdown of the Unicode sequence:
1. *Base Consonant:* ស (SA) - U+179F
2. *Subscript Consonant:* ្ស (subscript SA) - U+17D2 (KHMER SIGN COENG) +
U+179F (subscript form of SA)
*Unicode Sequence*
1.* Base Consonant: *
- U+179F (ស)
2.* Subscript Consonant: *
- U+17D2 (KHMER SIGN COENG)
- U+179F (subscript form of SA)
*Full Unicode Sequence*
Putting these together, the full Unicode sequence for "ស្ស" is:
U+179F (ស)
U+17D2 (្)
U+179F (្ស)
*UTF-8 Encoding*
To represent this sequence in UTF-8, each Unicode code point is converted
to its corresponding UTF-8 byte sequence:
- *U+179F* (ស) in UTF-8: E1 9E 9F
- *U+17D2* (្) in UTF-8: E1 9F 92
- *U+179F* (subscript SA) in UTF-8: E1 9E 9F
*Full UTF-8 Sequence*
Combining these, the UTF-8 encoding for the sequence "ស្ស" is:
*E1 9E 9F E1 9F 92 E1 9E 9F*
*Rendering Process*
1. *Base Consonant:* The rendering engine identifies the base consonant ស
(U+179F).
2. *Subscript Consonant:* It recognizes the subscript sign (KHMER SIGN
COENG, U+17D2) and attaches the following consonant to the base consonant
in its subscript form.
3. *Combination:* The engine renders the subscript consonant properly
positioned under the base consonant.
In summary, the Unicode sequence for "ស្ស" involves a base consonant
followed by a subscript sign and another consonant, encoded and rendered
according to the rules of the Khmer script. The UTF-8 encoding ensures each
character is correctly represented in byte form, which the rendering engine
interprets to display the correct combined character.
On Tuesday 21 May 2024 at 01:54:47 UTC+7 Matthias Kreier wrote:
I use Khmer script in my project and have the text in utf-8 and use the
Noto Sans Khmer ttf font file. For comparison I have the same text with the
same font in Word (left) and as result from a reportlab run (right):
[image: Screenshot 2024-05-21 015018.png]
The text should be:
54មនុស្ស
12ចៅក្រម
19ហោរា
53ស្តេច
82រយៈពេល
37ព្រឹត្តិការណ៍
18វត្ថុឬវត្ថុ
The font is: https://fonts.google.com/noto/specimen/Noto+Sans+Khmer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist2.pair.net/pipermail/reportlab-users/attachments/20240823/8126698c/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: khmer_shape.png
Type: image/png
Size: 36850 bytes
Desc: not available
URL: <https://pairlist2.pair.net/pipermail/reportlab-users/attachments/20240823/8126698c/attachment-0001.png>
More information about the reportlab-users
mailing list