Monday, 15 February 2010

Loading DTDs using DOM in Python

I use Python's xml.dom.minidom to process XML, but I'm a bit of a neophyte. I find it a really excellent approach for generating Fortran interfaces from our XML interface-specifications, but one thing's pretty inconvenient: entities don't get resolved, and we use a lot of entities.

In fact, it's worse than that. Entities just disappear in the processed DOM tree.

Given times.dtd
<!ENTITY times "&#215;">
and times.xml
<?xml version="1.0"?>
<!DOCTYPE times SYSTEM "times.dtd">
<maths>
  <mn>2</mn>
  <mo>&times;</mo>
  <mn>3</mn>
  <mo>=</mo>
  <mn>6</mn>
</maths>
and then times.py
import sys, xml.dom.minidom
sys.stdout.write(xml.dom.minidom.parse("times.xml").toxml())
you get
> python times.py
<?xml version="1.0" ?><!DOCTYPE times  SYSTEM 'times.dtd'><maths>
  <mn>2</mn>
  <mo/>
  <mn>3</mn>
  <mo>=</mo>
  <mn>6</mn>
Belgium! <mo>&times;</mo> has turned into </mo>.

I find the Python documentation for xml.dom quite daunting. From what I can tell you should be able to configure the whole experience—the parser, the entity resolver, one lump or two...

Until I work out how to do all that, my intermediate solution is to preprocess using xmllint to expand entities, before calling minidom: here's times2.py
import os, subprocess, sys, xml.dom.minidom
cmd_fo = open("times_expanded.xml", "w")
fail = subprocess.call("xmllint --loaddtd --noent " +
                       "times.xml",
                       shell=True,
                       stdout=cmd_fo,
                       stderr=sys.stderr,
                       close_fds=(os.name=="posix"),
                       universal_newlines=True)
cmd_fo.close()
sys.stdout.write(xml.dom.minidom.parse("times_expanded.xml").toxml())
which results in
> python times2.py
<?xml version="1.0" ?><!DOCTYPE times  SYSTEM 'times.dtd'><maths>
  <mn>2</mn>
  <mo>×</mo>
  <mn>3</mn>
  <mo>=</mo>
  <mn>6</mn>
Not very elegant though, is it?

Good news everyone: there's an alternate solution over at Stack Overflow, but it's still not perfect: use lxml instead of xml.dom.minidom. Unfortunately lxml doesn't come with the standard Python distribution, so I had to use my package manager to install python-lxml.

This time with a times3.py
import sys
from lxml import etree

parser = etree.XMLParser(load_dtd=True)
doc_DOM = etree.parse("times.xml", parser=parser)
sys.stdout.write(etree.tostring(doc_DOM) + '\n')
we get
> python times3.py
<maths>
  <mn>2</mn>
  <mo>&#215;</mo>
  <mn>3</mn>
  <mo>=</mo>
  <mn>6</mn>
</maths>
I think for the time being I'll stick with xmllint plus xml.dom.minidom, for greater portability.

2 comments:

  1. Hi Mat,
    I think you can swap in a more conforming XML parser, something like


    import sys, xml.dom.minidom, xml.sax
    parser=xml.sax.make_parser()
    sys.stdout.write(
    xml.dom.minidom.parse("times.xml",parser).toxml())

    ReplyDelete
  2. That looks good. More esoterica from the Python doc though:

    "xml.sax.make_parser([parser_list]): Create and return a SAX XMLReader object. The first parser found will be used."

    Found from where?!?

    ReplyDelete

NAG moderates all replies and reserves the right to not publish posts that are deemed inappropriate.