[reportlab-users] Validating PDF (and more) with JHOVE

Dinu Gherman gherman at darwin.in-berlin.de
Tue Oct 2 06:20:26 EDT 2007


Hi all,

I'm not sure people here know about JHOVE [1], at least it's not men-
tioned in the archives. But since it seems to be a very useful tool for
PDF developpers, I'll mention it here.

Briefly, "JHOVE provides functions to perform format-specific identifi-
cation, validation, and characterization of digital objects." It works
on many formats, including PDF and gives interesting information about
PDF [2] internals in reports like this:

dinu$ ./jhove -c conf/jhove.conf -m pdf-hul -k
iDVD_6_Getting_Started.pdf
Jhove (Rel. 1.1 (pre-release g), 2007-08-30)
Date: 2007-10-02 11:59:38 CEST
RepresentationInformation: iDVD_6_Getting_Started.pdf
ReportingModule: PDF-hul, Rel. 1.5 (2007-05-30)
LastModified: 2006-07-04 16:26:00 CEST
Size: 1040472
Format: PDF
Version: 1.5
Status: Well-Formed, but not valid
SignatureMatches:
PDF-hul
ErrorMessage: Invalid outline dictionary item
Offset: 985826
MIMEtype: application/pdf
PDFMetadata:
Objects: 1445
FreeObjects: 1
IncrementalUpdates: 0
DocumentCatalog:
ViewerPreferences:
HideToolbar: false
HideMenubar: false
HideWindowUI: false
FitWindow: true
CenterWindow: false
DisplayDocTitle: false
NonFullScreenPageMode: UseNone
Direction: L2R
ViewArea: CropBox
ViewClip: CropBox
PrintArea: CropBox
PageClip: CropBox
PageLayout: SinglePage
PageMode: UseOutlines
[...much more stripped off...]

As you can see it also indicates errors, although I'm not sure it can
report all errors at once. So far I can see only one error being re-
ported per file.

Wrapping JHOVE over some standard ReportLab docs gives the following
picture (extend your window to see the table rows on a single line):

dinu$ fi.py -a producer:status:errmsgext
reportlab_2_1/reportlab/docs/*.pdf
producer status errmsgext
file
Evaluation copy of RML2PDF htt[...] Not well-formed Malformed dict.
(Offset: 357892) RML_UserGuide.pdf
Evaluation copy of RML2PDF htt[...] Not well-formed Malformed dict.
(Offset: 957654) diagradoc.pdf
ReportLab http://www.reportlab.com Not well-formed Malformed dict.
(Offset: 152911) graphguide.pdf
ReportLab http://www.reportlab.com Not well-formed Malformed dict.
(Offset: 492792) graphics_reference.pdf
ReportLab http://www.reportlab.com Not well-formed Malformed dict.
(Offset: 79907) reference.pdf
ReportLab http://www.reportlab.com Not well-formed Malformed dict.
(Offset: 376267) userguide.pdf

I'm not sure if this is a reason to worry. I also found PDFs generated
by other software showing similar errors, including Adobe software.
In any case it seems to be a useful tool in your toolbox.

Regards,

Dinu

[1] http://hul.harvard.edu/jhove/
[2] http://hul.harvard.edu/jhove/pdf-hul.html



More information about the reportlab-users mailing list