[reportlab-users] RE: Really long words (Chad Miller)

Wed Jun 29 03:10:32 EDT 2005

> From: "Chad Miller" <Chad.Miller at veritas.com>
> Subject: RE: [reportlab-users] Really long words
> To: "Support list for users of Reportlab software"
> 	<reportlab-users at reportlab.com>
> Message-ID:
> 	
> <EC41F8507437734C9D3679FEBC902E6620AB1B at hroxchcln1.enterprise.
> veritas.com>
> 	
> Content-Type: text/plain;	charset="UTF-8"
> 
> > From: Bogdan Maryniuk [mailto:bo at bitute.b4net.lt]
> > Sent: Tuesday, 28 June, 2005 10:18
> > [...]
> > May I ask you what *exactly* hyphenation you're talking about 
> > thus and thus
> > should be implemented in RL? I mean, English? Russian? 
> > Japanese? Zulu?..
> 
> Yes.  (Perhaps not Japanese or other ideographic languages.)
> 
> Your tone suggests that you think it's not a solved problem, BM.  
> Read up on TeX.
> 
> - chad

Chad,

the problem is not so much building a word-breaking feature into
reportlab paragraphs, but indeed the hyphenation itself.

One of the non-serious efforts was mine,
you can still download it from deco-cow.sourceforge.net,
but be warned: I won't do not any further development there.
However, if you like the idea, I'd be glad to make you a
project admin there.

In the deco-cow project I developed two algorithms
and an interface:
The interface allows hyphenation of single words - without
the context (sentence) to be known. This works well for most
languages, however I know in some languages for some words
hyphenation IS depending on the context.
The first algorithm allows word-breaking in reportlabs paragraphs.
The second algorithm is the hyphenation algorithm itself, it is
working by decomposition of composed words - this is where the
name deco-cow comes from. This approach is very useful for languages 
like German where you compose words from simple words, like
"Silbentrennung" : Sil-be = syllable, Tren-nung = separation.
The algorithm only hyphenations those word it knows.
This reduces wrong hyphenations (I found a handful in Open Office and
LaTeX when the paragraph width is low with a real-world example text),
but on the other hand, it requires quite a big word basis of the
simple words.
This leads to two problems:
- I do not have the time (nor patience) to hack in a suitable amount
  of the simple words.
- If I had a big list of simple words, it would require more memory
  and (a little) more run-time.
- A lot of the german verbs are "weak" (it that the correct word?),
  like the english verb "to go":
  All these base words would have to be in the base word list:
  "go", "went", "gone".
  German:
  "geh"(en), "ging", (ge)"gangen"

However, for special purposes the approach works quite well.
For example, the german words for chemicals are usually quite long,
i.e. the word for RNA: "Ribonukleinsäure" (Ri-bo=nuklein=säu-re).
So if you want to create a database report and you know most of the
long words will be fixed data from the database like those chemical
terms, then just put the simple words that make up the chemical terms
into the base word list.

The interface also allows using other hyphenation algorithms as well,
for example the one used in TeX. However, I found that one not 
working stable with Python.
For further details give the project a look.

Henning