[reportlab-users] Paragraphs and hyphenation

Henning von Bargen reportlab-users@reportlab.com
Fri, 28 Nov 2003 14:28:54 +0100


Dinu Gherman and others (included myself) would like to have
better hyphenation support for Paragraphs.

#  Hyphenation support for tables is not necessary IMHO,
#  because you can use Paragraphs inside table cells

I just made some initial steps in this direction.

I modified paragraph.py so that it works without painting over the right
margin
(for example if you have long words in a in narrow table column).

AS OF NOW, THIS IS IMPLEMENTED ONLY FOR UNFORMATTED PARAGRAPHS

If the word does not fit into the line, the modified paragraph breaklines
method
- first tries to find good hyphenation points (right now only by looking for
"-", ";" "/" in the word).
  If it finds one, it splits the word at this point and puts the rest into
the next line.
- if there's no good hyphenation point AND the current line is still empty,
  the algorithm (at this time) puts as many characters as possible into the
line
  and the rest flows into the next line.
- If the word does not fit and the current line is not empty, the word flows
into the next line.

Example (view with a fixed-width font):

Input:
Example1: this is a hyphenation-example
Example2: Bundesverkehrsministerium

Output 9_     Output 12___
Example1:     Example1:
this is a     this is a
hyphenati     hyphenation-
on-           example
example
              Example2:
Example2:     Bundesverkeh
Bundesver     rsministeriu
kehrsmini     m
sterium

The next step would be to optionally attach a hyphenation
object to a paragraph (or a fragment).
If this hyphenation object is not None,
it could be called from inside the breakLines algorithm.

The obious solution for english hyphenation would be pyhnj.
However, for the German language, pyhnj (libhnj) does
not work very well.
I developed a prototype for german hyphenation
using the basic SiSiSi algorithm 
(see http://www.ads.tuwien.ac.at/research/SiSiSi/ ).
I also asked the author if the SiSiSi solution could
be donated to the public, but she denied.
But the description of the basic algorithm is public
and the basic algorithm is very simple.

So I hacked down a (VERY PRE-ALPHA version)
of the basic algorithm in Python.

The key features of SiSiSi are:
- it will only hyphenate words if it sure about what it does
  (no hyphenation of unknown words or if it not clear where to hypenate)
- it knows about the word roots, prefixes and suffix, for example
  the word Silbentrennung is made up of the word roots "Silbe" and "Trenn"
  with the suffixes "n" and "ung".
- It allows for different rankings of hyphenation points, for example
  Sil_ben=tren-nungs=ver_fahr-en.
  = : Very good break
  -  : Medium break
  _ : Last resort

Based on my experience so far, I would like the hyphenation routine to:
1) make use of the different rankings by assigning different penalty values
to
based on the ranking of the hyphenation position and the unused space in
the current line;
2) for a given hyphenation point, tell the breakLines routine if a "shy"
character 173
should be added to the first part or not.
Example: When hyphenating after a "-", there shouldn't be a shy character.

As a first step, we should only define an interface for the hyphenation
class
and a base implementation that only hyphenates after "-" characters.
Other implementions can then use pyhnj or SiSiSi or whatever.

If someone is interested in my results so far, I could send the files
paragraph.py, longtables.py
and the experimental wortzerlegung.py

BTW Is there any "contrib" place reachable from reportlab.org where one
could post such files?

Henning