[reportlab-users] ANN: wordaxe-0.2.6 released

henning.vonbargen at arcor.de henning.vonbargen at arcor.de
Tue Sep 16 17:18:33 EDT 2008


(This is a single reply to several posts)

Roberto Alsina wrote:


>It even downloads the dictionaries from the web.

> That's going to be tricky to support.


It's a feature of pyhyphen. You don't have to use it,
of course. In order to change this behaviour, you
could change the download URL in the file
lib/site-package/hyphen/config.py:

# default_dic_path = 'c:\python25\lib\site-packages\hyphen'
# default_repository = 'http://ftp.services.openoffice.org/pub/OpenOffice.org/contrib/dictionaries/'
default_repository = 'disabled://auto-download/of/dictionary-files/from/OOO'
-------------

Andy Robinson wrote:


> If we're going to refactor or replace paragraph code, we need to either

> (a) make it easy to plug in or switch the 'wrapping algorithm', or

> (b) create a Grand Unified Paragraph which does the right things

> for all languages.


It would be a small step for you but a great step for developer-kind
if at least the Paragraph class was more object-oriented.
The current implementation uses a bunch of private functions that are
NOT part of the Paragraph class - which requires to code duplication
in derived classes:

In the current wordaxe/rl/paragraph.py, I had to copy quite a lot
of reportlab/platypus/paragraph.py just because I needed a modified
"join" function.

It would really help a lot if these helper functions could be made
(static) methods of the Paragraph class, which would allow to modify
them in a derived class.

Whether (a) or (b) ... both options seem difficult.
But the slight refactoring I proposed above wouldn't hurt anyway.
And perhaps it's not only the wrapping algorithm.
Other languages are right-to-left (or even top-down, then left-to-right)
So the concept of a "line" could be horizontal or vertical...

The next thing to consider for refactoring is string/unicode.
The current implementation mixes them more or less freely,
which complicates things a bit. For Python 3 there will be no choice
anyway (I think), 'cause unicode will be the default to represent text.
Some time ago, there were concerns about memory consumption,
but frankly I don't think it's an issue - the Python overhead
(__dict__, ref counting and so on) outweighs it anyway in most cases.

So, why not turn everything into unicode at the paraparser level
and then encode to 8-bit only when actually writing the PDF?
If not for RL 2.2.1, then you should schedule it for one of the next
RL releases.


> My main worry, not having looked at the code yet, is how easily we can

> create a hybrid para which does all this stuff and also supports Japanese

> word breaking (which ReportLab can do with the cjkwrap options). We're

> rendering lots of Asian pages - go to www.hilton.co.jp and click "e-Pamphlet"

> if you can read katakana ;-)


Unfortunately, my katakana and hiragana knowledge is somewhat limited ;-)

There could well be two or more variations (read: classes) of a "Paragraph"
which differ in
* parsing (while RL understands a subset of HTML,
another class could parse Wiki markup or whatever)
* line-breaking
* reading direction
* rendering
* splitting
Most of this could be handled by mixin classes.
This would require well-documented interfaces/data structures, however.

For example, a SimpleParagraph might just accept text without layout
tags (resulting in fast processing and small PDFs),
while another paragraph class uses classic parsing, CJK-wrapping and
the full-blown-up renderer.

Note: The RL current paragraph.py implementation already uses different
"renderers" internally.

Though probably all this is not going to be easy,
it should be possible to go there step by step, where the first steps
should be documentation and the OO-refactoring I mentioned above.


>

> In brief, certain ranges of characters can be assumed to be

> Asian languages which don't have spaces between the words. Instead

> of hyphenating, you have characters which are good and bad to end a

> line on.

>

> Henning, have you done anything to do with that?


No. But wordaxe.rl.paragraph uses the CJK code
from reportlab.platypus.paragraph.


> Does the Openoffice extension offer any support for Asian languages?


I don't know.
-----------------------------------

Dirk Holtwick wrote:


> I didn't have a closer look to the source codes but what I would like

> most (if it makes sense) is to have the possibility to add something

> like HTML "­" or Unicode U+00AD into the text informations:



> <http://en.wikipedia.org/wiki/Soft_hyphen#Hyphens_in_computing>


The Hyphenator classes in wordaxe should support pre-hyphenated words
(containing SHY) out-of-the box.
However, AFAIK the ReportLab parser doesn't support the "shy" entity.

Should be easy for ReportLab to *optionally* support it in the parser,
though. Optionally, because you don't want it unless the line-breaking
algorithm support it.
Note: the SHY character should be ignored by the width calculation code.


> If this could be implemented in the paragraph rendering machine the

> hyphenation could be done outside of Reportlab.


This is not quite true, cause there is that funny
"non-standard hyphenation".
Adding SHY characters and hiding them where not needed is not
sufficient for these cases.
Fortunately, for the German language, these non-standard cases
disappeared with the "Rechtschreibreform", but imagine the compound
word "Schiffahrt" in the old days. The correct hyphenation was
"Schiff-fahrt" (one of the 'f's appearing only if the word is wrapped).
Such non-standard-cases still exist in other languages
(i.e. english: "eighteen" becomes "eight-teen" when wrapped).


> In my concrete case regarding Pisa <http://www.htmltopdf.org> I do not

> use the Paragraph parser. Instead Pisa directly creates the so called

> "Fragments". To these fragments I would like to be able to add something

> like u"beauti\u00adful".



> Andy, do you think this would be possible and make sense?


Hmm, the frag structure is not well documented, so I don't quite
like the idea. But you are in the same situation here as I am with
wordaxe...

BTW It seems one of the reasons wordaxe fails after splitting is
because frags are converted to another internal structure and then
back again during the splitting process, and by hyphenating a word
this conversion produces wrong results. And I finally gave up
trying to understand this conversion process...
Maybe Andy can see what's going wrong at a glance?
The effect is visible in the second frame of test_frames3.pdf in
wordaxe's test suite.




More information about the reportlab-users mailing list