[reportlab-users] fixing bugs in 2.2 / Hyphenation support

henning.vonbargen at arcor.de henning.vonbargen at arcor.de
Fri Sep 5 17:27:02 EDT 2008


Regarding the licence:

Since I wrote all of the wordaxe code alone,
I think I am allowed to release it using another licence.
The modified BSD licence used by ReportLab seems ok for me,
so feel free to include it with ReportLab.

However, I took the dictionaries used for the pyhnj hyphenation
from OpenOffice - they are licenced under the GNU LGPL license.
I don't know if this is problematic...

Maybe Matthias Klose can shed some light on these licence details?


Now from a technical point of view:

It was difficult for me to create a minimal invasive integration
of the wordaxe hyphenation library into ReportLab.
The best solution I could think of still required a minimal change
to rl_codecs.py.

==> It would help a lot if my version of rl_codecs.py could make
it into the RL repository. I can hardly imagine how this could
break existing code, but who knows... This should be reviewed by ´
the RL people.

Let me explain how the hyphenation works:
The line-breaking algorithm in the standard RL paragraph
implementation scans the current line word by word.
If the next word does not fit into the current line, it will be
pushed to the next line (even if it is wider than the available
width - this results in an ugly overflow).
The modified algorithm used in the wordaxe paragraph class will
hyphenate the word instead and consider the following options:
* squeeze the word into the current line (resulting in slightly
too narrower inter-word spacing)
* hyphenate the word at one of the possible hyphenation points,
appending the left part to the current line and pushing the
right part to the next line
* push the whole word to the next line
* if the current line is empty and there are no suitable
hyphenation points, create a (wrong) hyphenation point such that
as much letters as possible fit into the current line, pushing
the rest to the next line (this avoids overflowing for very long
unknown/nonsense words).
For each of these options, a rating is computed, and the option
with the best rating is chosen.

As you can see, the idea is actually quite simple.

The hyphenation library supports a concept of hyphenation point
quality, which is taken into account by the rating function.
For example, whereas "Urin-stinkt" is a valid hyphenation of
"Urinstinkt" (primal instinct), it is better to use "Ur-instinkt",
even if "Urin-" would fit into the current line.

In order to hyphenate a word, the word's language must be known.
(From a linguistic point of view, the context would be needed as
well, but I think this is out of reach for the next few years...)

And there may be different hyphenator implementations even for a
single language. For example, the hyphenation quality of the
dictionary-based wordaxe DCWHyphenator (given a good dictionary)
is by far better than that of the pattern-based PyHnjHyphenator,
but on the other hand PyHnjHyphenator is a lot faster.

And maybe you want to use automatic hyphenation for one paragraph
but not for the other.

So I decided for wordaxe that I needed at least two properties
(at the paragraph level):
* If to use automatic hyphenation at all
* The language

Unfortunately, the standard RL ParagraphStyle does not accept
attributes it does not know. That's why I needed a derived
ParagraphStyle class as well.

==> It would be nice if these two attributes could make it into
the standard RL repository:
'language':None,
'hyphenation':False,

==> I made a very little change in graphdocpy.py:
headerline = string.join(canvas.headerLine, ' \xc2\x8d '.decode('utf8'))
I added the decode(...) here. I don't think this is important,
as it's only used for generating the header line of the docco,
but perhaps it could go into the RL standard.

==> I also changed paraparser.py slightly:
Now you can set the parser's encoding explicitly, which allows
to parse using encodings different from "utf-8". I don't remember
why this was necessary, but I think it was for using some symbols.
I don't think it would break existing code, so it would be nice
if this could make it into RL standard, too.

In xpreformatted.py I had to make a very little change to supply
the base class Paragraph with an encoding:
class XPreformatted(_orig_preformatted):
def __init__(self, text, style, bulletText = None, frags=None, caseSensitive=1, dedent=0, encoding='utf8'):
self.encoding = encoding
...
==> This shouldn't break existing code and could perhaps make
it into the RL standard, too.

Last but least: The derived Paragraph class wordaxe.rl.Paragraph.
Now, while the basic idea is simple, this turned out to be very
hard work. The problem is about str/unicode conversion. And the
wordaxe paragraph implementation still has one major bug: When
the paragraph is split across frames, the first frame is rendered
correctly, but the next frame is more or less rubbish.

I changed the HyphenatedWord instance recently to derive from
unicode, with a __str__ method that returns an utf-8 string.
I hope that this modification - will allow me to rewrite the
paragraph implementation from scratch or - to be more precise -
from the standard paragraph.py - in a more clean way:
The goal is to ONLY change the breakLines method.

----------------------
To make things short:

* The licence shouldn't be an issue, please feel free to
use the wordaxe code in the RL toolkit under the
(3-clause) BSD licence.
* However, some dict files are from OpenOffice and under LGPL;
these are needed for the PyHnjHyphenator but not for other
hyphenators.
* Please try to integrate the changes marked with "==>" into
the RL 2.2 release - or at least RL 2.2.x a few days later!
(I can imagine hearing Dinu saying: "days, not years")
* This will allow me to concentrate on the derived Paragraph
class in order to fix the splitting bug etc.
* From my point of view, due to the serious bug mentioned above,
and in order to not further delay the 2.2 release, this the
best solution.

To make things even shorter (in Python syntax):
def bestSolution():
for f in os.listdir("wordaxe/rl"):
if f == "paragraph.py":
continue
merge_changes_into_RL_2_2(f)

Wow, this is probably my longest posting ever!

Henning
http://sourceforge.net/projects/deco-cow/




More information about the reportlab-users mailing list