Did you know: programmers convert coffe to code?

If you like my articles, support me with a small amount.


Buy me a coffee

Validating XML against XSD

Recently I got a question from one of our customers: how do we validate XMLs against their XSD definition? Because we offer an XML based interface to our system, where the customers have to provide the XML data which we read into the system — and naturally we give an Exception if the provided XML does not match our expectations (which is defined in the XSD naturally).

So we suggested some on-line resources where you can put in the XML and XSD and validate them and we mentioned the capability of open source tools too. After a few days I got an answer that they are not able to validate our example XMLs against the provided Schema so I should give a better hint how we do that. And this is why I created a simple Java application which takes some parameters and validates the XML against the XSD. After this, I looked up what could be done in this topic with Python.

The Java solution

There’s nothing fancy in this application, if you google search the web after XML XSD validation you will find a tons of code, tools. But I thought it is a great opportunity to provide our customers a free, self developed mini-application they can use. And for me it’s a possibility to enhance this version and make a GUI around the validator and so on…

Well, I’ll be lazy with this post, I will not provide any source code among the text nor any XMLs or XSDs. If you want to try it out, you’ll find infinite number of resources.

To start the application (the sources you can download from gitHub) enter following command:

java XSDValidator <xml files> <xsd files>

I made a fault-tolerant version, so you can enter the file names in any sequence, the only thing you should care about: the files have to end with either .xml or .xsd.

Looking at Python

But I’m currently curious about Python so I looked up how could I manage the same problem in the other programming language…

What every good and proven developer does in this case: he or she enters a search term into his/her favourite search engine to look up the available resources. Well, there’s not so much, most of the links refer to lxml so this should be the way I will go too.

First of all I had to switch from Windows to Linux because lxml works only with some Python versions — and there is no distribution for 3.3.1 on Windows. And because Microsoft is so kind and let everyone install a C compiler all-alone when needed I did not try to set up a C compiler, Cython and some other resources to compile the actual version of lxml on my computer.

However on my Linux partition I have a 2.7.4 Python which is ideal for this little testing. After installing libxml2 and libxslt (and the -dev version from both) the intallation of lxml was very easy. And there is a tutorial on the website how to use the library for XML validation against an XSD.  Yes, you’re right: one XML against one XSD. So if you have to validate against more than one XSD they are hopefully importing / referencing each other so it is the best only using that schema — otherwise it can lead to errors (about the errors a bit later)…

So I ported the Java application to Python — with only this difference: I validate every XML file against every XSD file provided at startup.

def validate_files():
    &quot;&quot;&quot; validates every xml file against every schema file&quot;&quot;&quot;
    for schema in xsd_files:
        xmlschema = etree.XMLSchema(file=schema)
        for file in xml_files:
            xml_file = etree.parse(file)
            if xmlschema.validate(xml_file):
                print(file + &quot; is valid against &quot; + schema)
            else:
                log = xmlschema.error_log
                print(file + &quot; is not valid against &quot; + schema)
                for error in iter(log):
                    print(&quot;\tReason: &quot; + error.message)

In line 4 you can see a simple way creating a Schema: you can assign a filename (inclusive path) to the constructor, so you do not have to handle file openings and conversions.

The validation (line 7) returns True or False and logs the validation error into an error_log object. This is a way nicer than the Java version with exceptions (so you do not have to surround the for loop with a try-except block).

But I have a bit to complain about lxml too: it was a bit hard work to look-up the right method of the error object which returns the error message: I did not find it in the tutorials nor in the API documentation. So I had to print out the structure of the object with dir() and guess (thankfully the naming was very good).

And as always the source is available at my gitHub repository.

About the errors

Earlier I mentioned some possible errors when validation one XML file against multiple XSDs: for example you have a base XSD (eventually provided by a third-party) which you need to extend (in XSD-language: you want to redefine the provided schema). At validation-time you start the application with the validated XML and both XSDs. So what result do you get?

With the Java version nothing, because it validates against both descriptions at the same time so your XML should pass the test. Validating against the base-schema you can get an error for example about misplaced, invalid elements.

Reason: cvc-complex-type.3.2.2: Attribute 'some-attribute' is not allowed to appear in element 'some-element'.

Besides this there are no problems with redefined XSDs in the Java version.

In python if I run the script with the two XSDs and one XML I get the following result (or something similar):

my-test.xml is valid against redefined-schema.xsd
my-test.xml is not valid against original-schema.xsd
    Reason: Element '{http://www.hahamo.biz/original-schema/V0.2}some-element', attribute 'some-attribute': The attribute 'some-attribute' is not allowed.
    Reason: Element '{http://www.hahamo.biz/original-schema/V0.2}other-element', attribute 'other-attribute': The attribute 'other-attribute' is not allowed.
    Reason: Element '{http://www.hahamo.biz/original-schema/V0.2}other-element', attribute 'another-attribute': The attribute 'another-attribute' is not allowed.
    Reason: Element '{http://www.hahamo.biz/original-schema/V0.2}other-element', attribute 'third-attribute': The attribute 'third-attribute' is not allowed.
    Reason: Element '{http://www.hahamo.biz/original-schema/V0.2}some-new-element': This element is not expected. Expected is ( {http://www.hahamo.biz/original-schema/V0.2}some-old-element).

Again: I find this solution better than the Java version: you have here all errors and the validation do not stop at the first occurring exception — so you get all your problems listed.

Conclusion

After implementing the Python version of the validator I am seriously thinking about sending our customer this version so they have a list of all failures in their XMLs. The problem is here, that no mortal has Python installed — however Java is always there (or you can install it easily — for Python you have to install lxml too and our customers are using windows). Maybe I should provide an on-line reachable interface for this Python validator where our customers can validate their code against our definition. If I do this, it will be another post.

GHajba
 

Senior developer, consultant, author, mentor, apprentice. I love to share my knowledge and insights what I achieve through my daily work which is not trivial -- at least not for me.

Click Here to Leave a Comment Below 1 comments