Website scraping with JSoup and XMLBeam — Part 3
In this article I’ll do a final comparison of the two libraries — however do not expect anything professional. It will be just a simple run-and-measure-time analysis of the two libraries on my dataset — and on two machines.
Comparison of runtime
Let me state it first: do not expect anything complicated in this section. I did not perform any perfect measurement of runtime, preparing workload or anything. I just let the code run on my working machine and at home.
I prepared 4 scenarios for the run:
- Execute the application only with JSoup extraction (so read all the results and write the CSV file).
- Execute the application only with XMLBeam extraction (the same as above).
- Execute the application first with JSoup then with XMLBeam (in the same run, calling both from the main method after they finished).
- Execute the application first with XMLBeam then with JSoup (the same as above).
And if I have time I’ll repeat it multiple times because it can really vary how much time it takes. And I tried not to do anything else on the computers because it could change the performance accordingly. Currently I’m working on a migration project where the processor is running at 100% for 30-50 hours (file comparison and other nasty things with data in the files) so if I’d start the extracting application parallel too the normal running time of 10-15 minutes for each extractor could be arouse to 1 hour.
If I’d have more time I could make some runs of the scenarios 1 and 2 in a loop however it takes too much time (500 seconds are almost ten minutes and 5 loops with the scenario is almost one hour — this is too much time to spend not working in the office or at home).
Before I start digging into the test results, I guess I should tell you the machines’ configuration I run my tests. So here it comes
- Home: MacBookPro @ 2.30 GHz (x 8), 16 GB 1600 MHz DDR3
- Work: MacBookPro @ 2.66 GHz (x 4), 8 GB 1067 MHz DDR3
So here are the results.
First the last two scenarios compared per extractor on my notebook at home:
As you can see above, which extractor comes first is finished in less time. This can happen because of Java garbage collection while the second extractor is working.
I got some free CPU time at work (in the night) so I added a loop to the execution. Below you can see the results of the tests:
As you can see, the first iteration results in a slightly overtime running of the first extraction (remember: in scenario 4 XMLBeam runs first followed by JSoup).
I altered the first two scenarios in the mean time: I ran the whole process of extracting for each tool in a 5-times loop at a time to measure performance with workload and garbage collection running (if needed — it depends always on the VM). This yields a better result than running one extractor only once.
|At work||At home|
As you can see, there is one peek at work with XMLBeam of about 900 seconds for the over 900 data to fetch and write. I guess something ran on my machine in the mean time or the network connection was bad — as it is sometimes.
Maybe I could implement such a looping-variation for the scenarios 3 and 4 although it would take about 2 hours for each to finish. And at work I do not have 4 hours of free time where I can monitor the application and the runtimes (4 hours because there are two scenarios).
Eventually I’ll do this at an evening — starting the application with a loop for each scenario and get the results next day. This would be it. If I can finish it till the article you’ll get to know.
Another option would be setting up my old RaspberryPi with Java and a new image and let the processes run there. I guess it would take a little bit more time because the performance is much smaller than on my two Macs. But it would make it possible to have another benchmark. If I do this I’ll tell you the results. But first I need to find my RasPi somewhere in the vault 🙂
There is no golden rule I can give you which tool to use further on. I only have some advices:
- If you have an ill-formed HTML: use JSoup. It can handle this problem for you and you do not have to do any errands (tidying the HTML, disabling validation and so on).
- If you have a well-formed HTML: use JSoup. This is because you have more freedom to play with cssquery than with XPath 1.0. Naturally sometimes you need some trick to get all the text you want — but it is not so bad. And JSoup is a little bit faster.
I prefer XMLBeam because it is less code and you can only get lost in XPath but you do not need any looping through child nodes — however sometimes this is the best solution.