[reportlab-users] Validating PDF (and more) with JHOVE
    Dinu Gherman 
    gherman at darwin.in-berlin.de
       
    Tue Oct  2 06:20:26 EDT 2007
    
    
  
Hi all,
I'm not sure people here know about JHOVE [1], at least it's not men-
tioned in the archives. But since it seems to be a very useful tool for
PDF developpers, I'll mention it here.
Briefly, "JHOVE provides functions to perform format-specific identifi-
cation, validation, and characterization of digital objects." It works
on many formats, including PDF and gives interesting information about
PDF [2] internals in reports like this:
dinu$ ./jhove -c conf/jhove.conf -m pdf-hul -k 
iDVD_6_Getting_Started.pdf
Jhove (Rel. 1.1 (pre-release g), 2007-08-30)
  Date: 2007-10-02 11:59:38 CEST
  RepresentationInformation: iDVD_6_Getting_Started.pdf
   ReportingModule: PDF-hul, Rel. 1.5 (2007-05-30)
   LastModified: 2006-07-04 16:26:00 CEST
   Size: 1040472
   Format: PDF
   Version: 1.5
   Status: Well-Formed, but not valid
   SignatureMatches:
    PDF-hul
   ErrorMessage: Invalid outline dictionary item
    Offset: 985826
   MIMEtype: application/pdf
   PDFMetadata:
    Objects: 1445
    FreeObjects: 1
    IncrementalUpdates: 0
    DocumentCatalog:
     ViewerPreferences:
      HideToolbar: false
      HideMenubar: false
      HideWindowUI: false
      FitWindow: true
      CenterWindow: false
      DisplayDocTitle: false
      NonFullScreenPageMode: UseNone
      Direction: L2R
      ViewArea: CropBox
      ViewClip: CropBox
      PrintArea: CropBox
      PageClip: CropBox
     PageLayout: SinglePage
     PageMode: UseOutlines
[...much more stripped off...]
As you can see it also indicates errors, although I'm not sure it can
report all errors at once. So far I can see only one error being re-
ported per file.
Wrapping JHOVE over some standard ReportLab docs gives the following
picture (extend your window to see the table rows on a single line):
dinu$ fi.py -a producer:status:errmsgext 
reportlab_2_1/reportlab/docs/*.pdf
producer                             status           errmsgext         
                 file
Evaluation copy of RML2PDF htt[...]  Not well-formed  Malformed dict. 
(Offset: 357892)  RML_UserGuide.pdf
Evaluation copy of RML2PDF htt[...]  Not well-formed  Malformed dict. 
(Offset: 957654)  diagradoc.pdf
ReportLab http://www.reportlab.com   Not well-formed  Malformed dict. 
(Offset: 152911)  graphguide.pdf
ReportLab http://www.reportlab.com   Not well-formed  Malformed dict. 
(Offset: 492792)  graphics_reference.pdf
ReportLab http://www.reportlab.com   Not well-formed  Malformed dict. 
(Offset: 79907)   reference.pdf
ReportLab http://www.reportlab.com   Not well-formed  Malformed dict. 
(Offset: 376267)  userguide.pdf
I'm not sure if this is a reason to worry. I also found PDFs generated
by other software showing similar errors, including Adobe software.
In any case it seems to be a useful tool in your toolbox.
Regards,
Dinu
[1] http://hul.harvard.edu/jhove/
[2] http://hul.harvard.edu/jhove/pdf-hul.html
    
    
More information about the reportlab-users
mailing list