In fact, it's worse than that. Entities just disappear in the processed
Given times.dtd
<!ENTITY times "×">and times.xml
<?xml version="1.0"?> <!DOCTYPE times SYSTEM "times.dtd"> <maths> <mn>2</mn> <mo>×</mo> <mn>3</mn> <mo>=</mo> <mn>6</mn> </maths>and then times.py
import sys, xml.dom.minidom
sys.stdout.write(xml.dom.minidom.parse("times.xml").toxml())
you get> python times.py <?xml version="1.0" ?><!DOCTYPE times SYSTEM 'times.dtd'><maths> <mn>2</mn> <mo/> <mn>3</mn> <mo>=</mo> <mn>6</mn>Belgium! <mo>×</mo> has turned into </mo>.
I find the Python documentation for xml.dom quite daunting. From what I can tell you should be able to configure the whole experience—the parser, the entity resolver, one lump or two...
Until I work out how to do all that, my intermediate solution is to preprocess using xmllint to expand entities, before calling minidom: here's times2.py
import os, subprocess, sys, xml.dom.minidom
cmd_fo = open("times_expanded.xml", "w")
fail = subprocess.call("xmllint --loaddtd --noent " +
"times.xml",
shell=True,
stdout=cmd_fo,
stderr=sys.stderr,
close_fds=(os.name=="posix"),
universal_newlines=True)
cmd_fo.close()
sys.stdout.write(xml.dom.minidom.parse("times_expanded.xml").toxml())
which results in> python times2.py <?xml version="1.0" ?><!DOCTYPE times SYSTEM 'times.dtd'><maths> <mn>2</mn> <mo>×</mo> <mn>3</mn> <mo>=</mo> <mn>6</mn>Not very elegant though, is it?
Good news everyone: there's an alternate solution over at Stack Overflow, but it's still not perfect: use lxml instead of xml.dom.minidom. Unfortunately lxml doesn't come with the standard Python distribution, so I had to use my package manager to install python-lxml.
This time with a times3.py
import sys from lxml import etree parser = etree.XMLParser(load_dtd=True) doc_DOM = etree.parse("times.xml", parser=parser) sys.stdout.write(etree.tostring(doc_DOM) + '\n')we get
> python times3.py <maths> <mn>2</mn> <mo>×</mo> <mn>3</mn> <mo>=</mo> <mn>6</mn> </maths>I think for the time being I'll stick with xmllint plus xml.dom.minidom, for greater portability.
Hi Mat,
ReplyDeleteI think you can swap in a more conforming XML parser, something like
import sys, xml.dom.minidom, xml.sax
parser=xml.sax.make_parser()
sys.stdout.write(
xml.dom.minidom.parse("times.xml",parser).toxml())
That looks good. More esoterica from the Python doc though:
ReplyDelete"xml.sax.make_parser([parser_list]): Create and return a SAX XMLReader object. The first parser found will be used."
Found from where?!?