Loading DTDs using DOM in Python

I use Python's xml.dom.minidom to process XML, but I'm a bit of a neophyte. I find it a really excellent approach for generating Fortran interfaces from our XML interface-specifications, but one thing's pretty inconvenient: entities don't get resolved, and we use a lot of entities.

In fact, it's worse than that. Entities just disappear in the processed DOM tree.

Given times.dtd
<!ENTITY times "&#215;">
and times.xml
<?xml version="1.0"?>
<!DOCTYPE times SYSTEM "times.dtd">
and then times.py
import sys, xml.dom.minidom
you get
> python times.py
<?xml version="1.0" ?><!DOCTYPE times  SYSTEM 'times.dtd'><maths>
Belgium! <mo>&times;</mo> has turned into </mo>.

I find the Python documentation for xml.dom quite daunting. From what I can tell you should be able to configure the whole experience—the parser, the entity resolver, one lump or two...

Until I work out how to do all that, my intermediate solution is to preprocess using xmllint to expand entities, before calling minidom: here's times2.py
import os, subprocess, sys, xml.dom.minidom
cmd_fo = open("times_expanded.xml", "w")
fail = subprocess.call("xmllint --loaddtd --noent " +
which results in
> python times2.py
<?xml version="1.0" ?><!DOCTYPE times  SYSTEM 'times.dtd'><maths>
Not very elegant though, is it?

Good news everyone: there's an alternate solution over at Stack Overflow, but it's still not perfect: use lxml instead of xml.dom.minidom. Unfortunately lxml doesn't come with the standard Python distribution, so I had to use my package manager to install python-lxml.

This time with a times3.py
import sys
from lxml import etree

parser = etree.XMLParser(load_dtd=True)
doc_DOM = etree.parse("times.xml", parser=parser)
sys.stdout.write(etree.tostring(doc_DOM) + '\n')
we get
> python times3.py
I think for the time being I'll stick with xmllint plus xml.dom.minidom, for greater portability.


  1. Hi Mat,
    I think you can swap in a more conforming XML parser, something like

    import sys, xml.dom.minidom, xml.sax

  2. That looks good. More esoterica from the Python doc though:

    "xml.sax.make_parser([parser_list]): Create and return a SAX XMLReader object. The first parser found will be used."

    Found from where?!?


Post a Comment

NAG moderates all replies and reserves the right to not publish posts that are deemed inappropriate.

Popular posts from this blog

Implied Volatility using Python's Pandas Library

C++ wrappers for the NAG C Library

ParaView, VTK files and endianness