Pisa and Reportlab pitfalls

Generating PDFs with Django, Pisa and Reportlab and what to look out for

About a week ago an entry about generating PDFs with Django was posted on the Uswaretech Blog. In particular this blog post talks about using Pisa, a html2pdf python library to generate complex PDFs from existing HTML pages. I now took the chance to finish the draft for this blog post you are reading right now, which was lying around for about 2 months, which I originally wrote to point out some pitfalls I ran into while using Pisa in a Django project.

I'm using Reportlab for PDF-generation, which is a very powerfull open-source python library. Reportlab features both, a good low-level API for generating documents and an higher level abstraction with has an layout-engine, which knows where to do pagebreaks and such things. Some documents are easy to build using the reportlab API, especially documents which contain much text and are not so heavily styled. For documents which are heavily styled I added one more tool to do the heavy job, while I could concentrate on technologies, which I'm fluent in. The solution was to use pisa and write the documents in plain old HTML+CSS.

Pisa is an open-source python library which uses html5lib to parse HTML (with CSS) documents and then creates a PDF using reportlab. The results are pretty good and pisa provides some vendor-specific CSS extensions, which allow styling pages with different templates and adding static headers and footers. Additionally there are some pisa-specific XML-tags, like <pdf:pagenumber />, which allow adding pagenumbers, pagebreaks etc. to the resulting PDF.

Both, reportlab and pisa, have some documentation, but I want to document some gotchas I couldn't find in the docs, which took some time for me to figure out. (And I hope to save someone else some time figuring this stuff out.)

Where are my pagenumbers?

As said before pisa allows adding static headers and footers via CSS, this is documented very well, so will not repeat it here. One problem I ran into was adding a static footer to my pages, which contains a <pdf:pagenumber /> tag and should show the current pagenumber at the bottom of every page of the resulting PDF. The problem was, that every page just showed the number 0. The solution was very simple, but I was only able to figure it out after studiying the pisa source-code: You can only use the <pdf:pagenumber /> tag inside a paragraph, not (as I did) inside a table for example. Even parapgraphs inside tables don't work. It has to be a top-level paragraph.

Wrong Pagebreaks

The documents I was generating are starting with a headline, short introduction text (about 5 lines) and then follows another headline and a long table (more than one page). Pisa and reportlab know how to do pagebreaks in tables, but I had the problem, that everytime the table was longer than the remaining space on page one a pagebreak appeared directly after the introduction text, the table starts at the top of page two and was correctly split over the next few pages. The pagebreak was added by reportlabs layout-engine (platypus), which is rather smart, but I had to find out what was going on, before I could understand why this was happening.

The layout-engine knows a concept of keep-with-next, which avoids orphaned elements on the bottom of a page. Pisa assignes a default keep-with-next attribute to all HTML headers (h1-h6), which is a good thing most of the time, but had the following consequences in my case: reportlab knows that the headline before the table and the table should be kept together, because both together don't fit on the remaining space on the first page, they are moved to second page. They don't fit on this page either, but now the pagebreak is done in the table, because nowhere in the document will be more space as on a new blank page.

The solution to avoid this pagebreak and just have a normal break inside the table is to assign a css style of "-pdf-keep-with-next:false;" to the headline just before the table. This will tell pisa to tell reportlab not to use a keep-with-next around the headline and the table. Reportlab will put the headline on the first page, then the table and will notice that the table don't fit on the page and will add a pagebreak inside the table, just as one would have expected.

Adding Pictures to the PDF

This one is rather trivial and not really a pitfall, but as it fits nicely into this topic I'm going to write it down here. To be able to get pictures into the PDF, which are visible on the HTML page, you should define a link-callback function, which knows how to translate a src attribute from the HTML document to a local path to the image. If your are processing remote files, this callback could even fetch the image, but it has to return a path to image where reportlab can find it on the filesystem, not a file-like object or something else. A very simple link callback function which should work for most Django project could look like this:

import os
from django.conf import settings

def fetch_resources(uri, rel):
    """
    Callback to allow pisa/reportlab to retrieve Images,Stylesheets, etc.
    `uri` is the href attribute from the html link element.
    `rel` gives a relative path, but it's not used here.

    """
    path = os.path.join(settings.MEDIA_ROOT, uri.replace(settings.MEDIA_URL, ""))
    return path

Veröffentlicht von Arne Brodowski am 16. Okt. 2008, 10:39 in django, pdf, pisa, python, reportlab.

Kommentare

My company has try both reportlab and pisa but found none good enough. Then we tryied pyuno, the OpenOffice python api. You can do some interesting thinks using this, such as modify the resulting document exporting it to odt, import html into a document, use template based design for documents and so on.

We found it the most interesting alternative for now. i encourage to give it a chance.

Geschrieben von esauro 2 Stunden, 5 Minuten nach Veröffentlichung des Blog-Eintrags am 16. Okt. 2008, 12:44. Antworten
pyuno looks interessting, but a bit more complex than pisa+reportlab. Thanks for pointing out, I will look into it the next time I need such functionality.

Geschrieben von Arne 2 Stunden, 12 Minuten nach Veröffentlichung des Blog-Eintrags am 16. Okt. 2008, 12:51. Antworten
An even better solution using OpenOffice is to 1) create your template using OOWrite (also inserting some placeholders maybe) 2) unzip/extract content.xml and style.xml from yourtpl.odt 3) process these files with your favourite XML processor (mine is lxml) to populate your data 4) replace the files in a copy of yourtpl.odt with the populated ones 5) use few lines of python or jython to take control of a listening OO session in order to open the new odt and export it to PDF. WOks like a charm for me.

Geschrieben von Olive 4 Stunden, 3 Minuten nach Veröffentlichung des Blog-Eintrags am 16. Okt. 2008, 14:42. Antworten
Olive: I've done this already, but I think the solution with pisa is much easier, considering that the HTML page is already existent.

Geschrieben von Arne 7 Stunden, 15 Minuten nach Veröffentlichung des Blog-Eintrags am 16. Okt. 2008, 17:54. Antworten
I'm moving from OO solution to Pisa. I need to render PDF on the fly on a web server. I used to run OO with Xvfb on a Linux which was not so stable in terms of memory leak, or couldn't handle heavy simultaneous accesses. To access a single OO from multiple users is potentially unsecure. OO is too big to start for each request. Only nice thing about OO solution is that my customer can edit the template using their familiar tools, OO or MS Office, at ease.

I used to think the result of HTML rendering is unpredictable or hard to control positions, for example a invoice letter with name and address that need to be seen from a hole of envelope, and perforation on a precise position. CSS is actually capable of describing, and Pisa+Reportlab's rendering is mature enough for my purpose. It's easy to fix as it's Python.

I'm a very happy Pisa user!

Geschrieben von kenboo 1 Jahr, 10 Monate nach Veröffentlichung des Blog-Eintrags am 27. Aug. 2010, 05:20. Antworten
Hi Arne,

I am the author of Pisa. I like this article very much and will try to eliminate the described pitfalls as soon as possible. Thanks for using the tool.

If anyone else traps into problems with Pisa do not hesitate to join the mailing list and I will try to help as soon as possible:

http://groups.google.com/group/xhtml2pdf

Dirk

Geschrieben von Dirk Holtwick 6 Tage, 1 Stunde nach Veröffentlichung des Blog-Eintrags am 22. Okt. 2008, 12:10. Antworten
hi arne,

i'm fairly new to django and not too fluent in python either. your approach to adding pictures to the pdf sounds quite interesting, but how do i actually use it in my code. i have rendered my template and have the html file i want to convert, but how do i actually change each path in the existing html?

amy

Geschrieben von amy 1 Monat, 3 Wochen nach Veröffentlichung des Blog-Eintrags am 7. Dez. 2008, 22:29. Antworten
Amy, you don't have to change the paths in your HTML template, as long as the pictures show up correctly in a webbrowser.

The code above is a function, which will be passed to the pisa function, which contructs the PDF. The purpose of the function is to translate the Image-URLs in the HTML page to filesystem paths, which can be used to embed the picture in the PDF document.

Geschrieben von Arne 1 Monat, 3 Wochen nach Veröffentlichung des Blog-Eintrags am 8. Dez. 2008, 06:16. Antworten
hello arne,

when trying to handle pagebreaks IN tables with pisa i'm facing some problems:
i'm not sure how/where to apply the "-pdf-page-break". as you pointed out in your article "Pisa and reportlab know how to do pagebreaks in tables" it would be great if you could give me a short example.

thanks a lot, axel

Geschrieben von axel 2 Monate, 3 Wochen nach Veröffentlichung des Blog-Eintrags am 7. Jan. 2009, 17:10. Antworten
Hi axel,

"Pisa and reportlab know how to do pagebreaks in tables" means that they will automatically break a table that is longer than one page at the end of the pages and continue the table on the next page.

Applying -pdf-page-break should not be needed inside the table. I don't know if it would be even possible but I suspect not.

Maybe you could ask for help at the google group for pisa: http://groups.google.com/group/xhtml2pdf

Geschrieben von Arne 2 Monate, 3 Wochen nach Veröffentlichung des Blog-Eintrags am 7. Jan. 2009, 17:35. Antworten
hello arne,

thank you for the hint - i thought -pdf-page-break might be used to force/avoid pagebreaks wherever you want.

as i have ongoing troubles with the page-breaks (also the CSS page-break-before etc.) i'll ask for help at the group you mentioned.

Geschrieben von axel 2 Monate, 3 Wochen nach Veröffentlichung des Blog-Eintrags am 8. Jan. 2009, 14:10. Antworten
hey folks!

i found another little pitfall in the image-processing-part (that i could find neither in your information nor in the official documentation), so i'm going to post it here, maybe it saves somebody's valuable worktime.

you _have_to_ use the height and width attribute in the <img /> tags of the images that should be placed in the pdf. if you don't have them, pisa will just ignore the images.

arne, thanks for this site, it saved me a lot of time :)

cheers, nitram

Geschrieben von nitram 4 Monate, 1 Woche nach Veröffentlichung des Blog-Eintrags am 23. Feb. 2009, 14:24. Antworten
@nitram: Thank You nitram!!!!

Thanks a lot, really. You're _have_to_ use advice helped me alot.

Geschrieben von Gaston Ingaramo 1 Jahr, 2 Monate nach Veröffentlichung des Blog-Eintrags am 7. Jan. 2010, 15:50. Antworten
Hi, Thanks for your article.

I am successfully converting an html page with multiple tables to PDF using tips from this article and the one at UsWareTech.

However, the rendered PDF does not show any table styling. Not even simple table borders, only text. My html tables are styled with CSS, with lots of different styling for <th> <tr> etc. Is it possible to have that styling show up in the generated PDF?

Geschrieben von S Kujur 1 Jahr, 6 Monate nach Veröffentlichung des Blog-Eintrags am 19. April 2010, 07:11. Antworten
To answer my own question, see this link

http://groups.google.com/group/xhtml2pdf/browse_thread/thread/44c0619a60c6b69/

Geschrieben von S Kujur 1 Jahr, 6 Monate nach Veröffentlichung des Blog-Eintrags am 19. April 2010, 07:23. Antworten
Hi,

When I add <pdf:pagenumber/> to my file I get double page numbers e.g. 11 instead of 1 on the first page. Any thoughts?

Geschrieben von Mike Gauthier 1 Jahr, 8 Monate nach Veröffentlichung des Blog-Eintrags am 17. Juni 2010, 01:39. Antworten
For <pdf:pagenumber /> generating 11 instead of 1, please see http://groups.google.com/group/xhtml2pdf/browse_thread/thread/46dfce4237310c15/88ee497ec03961b9?lnk=gst&q=pagenumber+11#88ee497ec03961b9

Geschrieben von S Kujur 1 Jahr, 8 Monate nach Veröffentlichung des Blog-Eintrags am 8. Juli 2010, 10:02. Antworten
It is bad form to user spaces in URLS. I was working on a site today, where the admin had uploaded some images with spaces in the names. While most browsers tolerate such bad URLS, the html to pdf program did not accept them. In Django I created a template filter to replace the spaces with %20 and now everything is working fine.

Geschrieben von Paul Egges 1 Jahr, 8 Monate nach Veröffentlichung des Blog-Eintrags am 13. Juli 2010, 06:51. Antworten