Did you know? Programmers convert coffee to code.

If you like my articles, sponsor me a coffee.

Website scraping with JSoup and XMLBeam — Part 2

In the last article I covered XMLBeam for scraping a not so well formed HTML site which gave me a lot of pain. Now I’ll look at the same task implemented with JSoup. I guess I can mention it at the beginning that the ill-formedness caused no pain with this tool.

JSoup

JSoup is a Java tool for working with real HTML data. You can read / parse and manipulate the website — and it does not have to be well-formed (as it has to be with XML documents). And it is quite simple to use because you can query the HTML elements via CSS-query which is almost the same thing as providing an XPath expression on tags an their classes.

JSoup’s ancestor is the Python library BeautifulSoup which does the same as it’s Java child: enables to work with real HTML data.

Timeout for reading the website

The first problem I encountered was the read timeout.

You can parse an HTML string with JSoup or you can feed it with a URL and the tool loads the site for itself and parses it. Well, there is some timeout for reading and if your target has a slow response time (because you execute a query where it has to gather all the results) it can happen that you get an exception. Fortunately you can adjust the timeout as you need. In my case 5 seconds have been enough although I’ve set the global timeout setting in the application to 30 seconds — and you can override it with a startup parameter as you wish. However I recommend that you keep the value small because it is annoying to wait for a minute without a response and at the end you get an exception that the read timed out. It is a pity, isn’t it?

But I think you want to know how to set the timeout for the reads. It is simple:

Connection conn = Jsoup.connect("your URL as a String");
conn.timeout(timeoutInMillisecondsAsInteger);
Document site = conn.get();

As you can see it is simple. You only have to create a connection with JSoup and have to do the parsing later. The resulting document is as the same what you’d get if you

Adding a proxy

The solution mentioned with XMLBeam works with JSoup too however there is an alternative to accessing a proxy — but not in JSoup itself. This is because the previously mentioned solution fits perfectly well for JSoup and other developers too I guess. So there is no need for adding proxy capabilities to JSoup itself.

To add a proxy you have to manually download the website with a proxy connection and then let JSoup parse your result.

URL website = new URL("URL of the website");
Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress("address of the proxy server", proxyPortAsInt));
HttpURLConnection httpUrlConnetion = (HttpURLConnection) website.openConnection(proxy);
httpUrlConnetion.connect();

To read the website after created the HTTP connection you can use a BufferedReader and read the lines from the connection:

String line = null;
StringBuilder tmp = new StringBuilder();
BufferedReader in = new BufferedReader(new InputStreamReader(uc.getInputStream()));
while ((line = in.readLine()) != null) {
    tmp.append(line);
}

Document doc = Jsoup.parse(tmp.toString());

or you can use the Apache IOUtils and read the InputStream simpler to a string object:

Document doc = Jsoup.parse(IOUtils.toString(uc.getInputStream()));

I prefer the second version 🙂 That’s because it is a single liner and I do not have to mess around with closing the streams (did you notice in the example above that there’s no closing the streams? 🙂 ) and creating two readers, a StringBuilder (or a StringBuffer if you want something thread-safe)

The solution

With JSoup I have had less errands to make because this tool is more tolerant against ill-formedness than XMLBeam (or any XML parser) and I could use the cssquery syntax to easier my work.

However at some parts of the solution I needed a list of data (for example anchor references “href”) and I had to loop through a set of nodes and get the results stuffed in a List of Strings.

At the end the JSoup extractor resulted in around 450 LoC including the CSV printing. Around 50 lines are the overhead of creating a class containing the CSV data — with getters and setters. XMLBeam let me eliminate these methods.

As I mentioned previously the code itself has been made for a customer of us so I cannot provide you the code. However I have taken a part to show you how I iterated over the HTML body. It is not the part I’ve shown you with XMLBeam. That’s because I’ve taken the registration number from the URL instead of parsing it from the webpage.

Elements coordination = site.select("div.staff-box:contains(Coordination)");
if (coordination.first() != null) {
    Elements children = coordination.first().children();
    children.remove(0);
    if (children.select("span > a").first() != null) {
        Element anchor = children.select("span > a").first().parent();
        children.remove(anchor);
    }
    StringBuilder contactDetails = new StringBuilder();
    for (Element e : children) {
        if (StringUtils.isNotBlank(e.text())) {
            contactDetails.append(e.text());
            contactDetails.append("<br/>");
        }
    }
    line.setCoordination(contactDetails.toString());
}

As you can see with cssquery you can search not just for elements (tags) with certain attributes and/or children — you can search after text which is contained in the element’s text nodes. With this you can eventually skip the “//div/[1]” selections in XPath statements and make your code more readable for someone who wants to follow the steps you’ve made. Eventually this someone could be you some weeks/months/years later.

Next time…

One thing is left in this article-series: comparison of the two frameworks and a final conclusion if any can be drawn.

Stay tuned.

GHajba
 

Senior developer, consultant, author, mentor, apprentice.

I love to share my knowledge and insights what I achieve through my daily work which is not trivial — at least not for me.

Click Here to Leave a Comment Below 2 comments
%d bloggers like this: