Did you know? Programmers convert coffee to code.

If you like my articles, sponsor me a coffee.

At the other end of the link I post some results of my other hobbies, like playing bass. It is worth checking it out ;)

XML Processing and the GAE

The problem with some articles are that I get the idea of them as I work on a specific topic however I end up writing the article itself weeks or months later. This article has the same issue: I thought about it at the middle of may and now I’ve forgotten why I wanted the Google App Engine (GAE) to be a part of the topic.

Let’s see, what I get at the end.

Because I do not know what my intention was when I added this title to a completely empty draft article I’ll hook to the title and write about XML transformations on the GAE.

Writing about development to a GAE environment is always kind of “fun” because you have your solutions — and at the end you get a punch in your face from GAE: some classes you want to use are not permitted. Then you start looking for a solution inside of the feasible area.

This was the case when we (a co-worker and I) had the task to render an XML (provided from somewhere somehow) in various formats: PDF and RTF. And as a bonus (because rendering those documents was not the easiest thing) I implemented a web-based display too to see if we get the right data and display it as it should be.


As I mentioned this was no requirement but I wanted to see results as soon as possible so I added an HTML display of the XML input.

Converting XML to HTML is easy: you only need to do an XSL Transformation (XSLT) and then you are done: as the result you get an HTML file (or XML or text — depending on your configuration). But this is for GAE a no-go because you are not allowed to create files dynamically from your application.

Nevertheless you can end up with a solution to display your XML data represented as an HTML page: you only have to add the stylesheet to your data and most of the browsers will display it accordingly.

How to add the stylesheet? This one is easy: you have to add a tag containing the stylesheet to your XML-Data. For example:

<?xml-stylesheet type="text/xsl" href="stylesheets/detailHtml.xsl"?>

to transform the XML to HTML with XSLT (the detailHtml.xsl contains the transformation information).

If you get your data from an interface (for example from a SOAP Service) you have to be a bit tricky to get your XSL into your XML — because you get all of the data in one XML dataset. However if you think about a solution you would end up with: replacing the starting root node with itself and the stylesheet-node. With this workaround you can alter the XML dataset and display it along with XSLT. And this works with GAE too.


Converting an XML to a PDF is something simple too: with XSLT you create an XSL-FO (FO for Formatted Objects) document from your XML. An FO document is an XML using element names (node names) from the FO namespace. After this you can send your resulting FO document to a render-engine (for example Apache FOP) and you get your PDF.

Sounds simple however GAE does not allow some of the classes wich are used by Apache FOP (for example AWT graphics). So there is need for another workaround.

iText is a good alternative to FOP however it does not handle FO documents. Nevertheless, iText has an XmlWorker project which should be used to render XML (XHTML) documents. So this sounds very good so I gave it a try.

Unfortunately I had some problems with applying the required CSS to the XHTML output (some of them worked, some not) and as far as I can remember the XML Worker had some problems with displaying the required images too. And beside the images there is a requirement of specific fonts to use when displaying the texts — and this is hardly manageable too when it comes to XHTML to PDF conversion (or at least I did not find a good-enough solution).

So I ended up creating the PDF manually with iText added each element on it’s own programmatically. To achieve this I created a custom XML extractor which split the provided XML result document into some classes (grouped by coherence) and added display-information to these classes.

This was the least time-consuming solution. Eventually I could have taken a look at Flying Saucer (which has the same purpose as the XML Worker: to create PDF from XHTML) but as I mentioned we needed the data as quick as possible. However if I get some free time between my projects I’ll take a look at Flying Saucer and try out how good it is to generate the required PDF from an XHTML.


The second requirement was to create an RTF document. Why RTF? Because it can be displayed over various platforms (Windows, Mac OS, Linux) — and there exist some open source tools to create RTF documents and it has been used in another project successfully.

Well. What you have to know about RTF is: it was a Microsoft standard until 2008. Since then Microsoft gave it up and does not improve it any further. They work on their new standard (the “.*X” in my terminology for .docX, .xlsX, .pptX). Besides this RTF was not supported with 100% by any other document editor than MS Word. And the tool which had been used (namely jRTF) does not implement all of the features and creating documents with it can be a pain in the neck.

iText had an RTF generator too (alongside with the PDF generator) but it isn’t improved anymore. So we did not even try to use it for our purposes.

But we did it — as far as we could. Some features (like embedding fonts) do not work so we had to loosen the requirements for the RTF documents.

You could ask why are RTF documents needed if you have a PDF? Well, the PDF can contain too much information or the display order of the data could be not the best (a good example for this is a CV or some management reports) and the user want to alter the document.

XML to “.*X”

Yes. Finally here it comes. I mentioned Microsoft’s new standard for documents above. And there is a possibility that the management will decide to create a Word document from the provided XML data.

Currently we’re evaluating the possibilities and features of frameworks such as Apache POI and docx4j because we have already the data extracted from the XML to create the docx programmatically. Parallel we are evaluating a solution to transform the XML data into a docx with XSLT. If we get to any results I’ll end up with an article about it.

Exporting the files in GAE

Mentioned before: GAE does not let to write files to the file system. So if you want to create a document (PDF, RTF, docX or anything) you have to return it from your web application — and not saving it to the file system of the server. How to do it?

If you have a class extending the javax.servlet.http.HttpServlet in GAE this is easy.

Most of the libraries give you the ability to create the document (PDF or RFT — let’s stick to these two) to a java.io.OutputStream which you can convert to a byte array (byte[]). And you use this byte array to enable downloading the created document. Let me show an example code with the overridden doGet method and a PDF document:

protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
    OutputStream os = resp.getOutputStream();
    resp.setContentType("application/pdf"); // set the content type of the result
    resp.setHeader("Content-Disposition", "filename=Result.pdf"); // set the name of the resulting PDF file

    os.write(createPDF().toByteArray()); // createPDF() returns a ByteArrayOutputStream

If  you want to export an RTF you have to alter the content type as follows:

resp.setContentType("application/rtf"); // set the content type of the result
resp.setHeader("Content-Disposition", "filename=Result.rtf"); // set the name of the resulting RTF file

I included the setHeader function only to show that it should be not bad if you alter the file extension to .rtf from .pdf.

With this settings RTF files will be downloaded PDF files only displayed in the browser (if you have this option enabled and your browser is capable to render the PDF file without downloading it). However if you do not want the browser to display the PDF file just download it, you can enhance the content disposition as follows:

resp.setHeader("Content-Disposition", "attachment; filename=Result.pdf"); // set the name of the resulting PDF file

This was it. For this article there’s no GitHub repository. If you’d like to see a sample application just mail me or write a comment here and I’ll create the application at GAE as soon as possible (I’d say within a week).

Share the knowledge!

Senior developer, consultant, author, mentor, apprentice. I love to share my knowledge and insights what I achieve through my daily work which is not trivial -- at least not for me.

Click Here to Leave a Comment Below

%d bloggers like this: