[reportlab-users] Hyphenation

Mon Jun 18 05:29:36 EDT 2018

Dinu Gherman <gherman at darwin.in-berlin.de> writes:

> I was browsing a few hours ago on “python hyphenation” and found some stuff
> I was not aware of, like http://pyphen.org.

Thank you Dinu,

pyphen API is so straightforward that I could not resist trying to inject it
in the process, so I spent an hour this morning and I wrote a quick&dirty
hack, that is already able to handle the simplest case.

I wrote a PyphenParagraph class that accepts a "hyphenator" instance in its
constructor, overriding the "breakLines()" method and extending the "split()"
method. In "breakLines()", whenever it meets a word that does not fit in the
available space it calls a new "hyphenateWord()" method that may return a
(headWord, tailWord) pair on success, that it pushes back in the "words" list.

Basically:

    class PyphenParagraph(Paragraph):
        def __init__(self, *args, hyphenator=None, **kwargs):
            self.hyphenator = hyphenator
            super().__init__(*args, **kwargs)

        def split(self, availWidth, availHeight):
            # Propagate the hyphenator to the splitted paragraphs: parent's split() uses
            # "self.__class__(foo, bar, spam=eggs)" to create them...
            pair = super().split(availWidth, availHeight)
            if pair:
                pair[0].hyphenator = pair[1].hyphenator = self.hyphenator
            return pair

        def hyphenateWord(self, word, availWidth, fontName, fontSize):
            for head, tail in self.hyphenator.iterate(word):
                head += '-'
                width = stringWidth(head, fontName, fontSize, self.encoding)
                if width <= availWidth:
                    return _SplitText(head), tail

        def breakLines(self, width):
            ... # untouched code up to
                while words:
                    word = words.pop(0)
                    #this underscores my feeling that Unicode throughout would be easier!
                    wordWidth = stringWidth(word, fontName, fontSize, self.encoding)
                    newWidth = currentWidth + spaceWidth + wordWidth
                    if newWidth>maxWidth:
                        if self.hyphenator is not None and not isinstance(word, _SplitText):
                            pair = self.hyphenateWord(word, maxWidth - spaceWidth - currentWidth,
                                                      fontName, fontSize)
                            if pair is not None:
                                words[0:0] = pair
                                continue
                        ... # untouched code till the end

However, I must be missing something in the "width" argument, because for
example when using a ImageAndFlowables it clearly uses the wrong width in the
"second" part (where the image ends so there's a wider space available)...

Anyway, before going any further in my experiments, I would like to know if I
am on a good track or not, to avoid wasting energy :-)

Here is my script: https://gist.github.com/lelit/9c1cba52fd6dd9f1123fe82ce4b788db

It obviously require a "pip install pyphen" and a copy of RL's
tests/pythonpowered.gif: executing it you will get a simple document with two
paragraphs, the first with an image in its top left corner and a second plain
paragraph. The latter is correct, while in the former you can spot a "bogus"
hyphenation is happening in the "Les-ser GPL" line...

Thanks in advance for any hint,
ciao, lele.
-- 
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
lele at metapensiero.it  |                 -- Fortunato Depero, 1929.