Tuesday, 23 February 2010

Documentation Formats

NAG documentation is available in several formats. For the Fortran Library, the following formats are available:

  • XHTML XHTML + MathML files, one per routine, with XHTML tables of contents.
  • PDF Individual PDF files, for each document, with additional HTML navigation pages.
  • Single File PDF A single PDF file essentially being a repackaging of the individual PDF files.
  • Windows Help File Windows HTML help is essentially a compressed archive of HTML+MathML files derived from the XHTML+MathML documentation, together with additional index and navigation controls.

The C Library documentation, which is largely derived from the same source, is available in similar formats.

In addition, documentation for other interfaces are available, also usually generated from the same (XML) source. For example, the NAG Toolbox for MATLAB documentation is available as a MATLAB help file (essentially a jar archive of customised HTML files) and as a collection of PDF files with HTML tables of contents for ease of navigation. The NAG Library for .NET, currently in beta test, provides the documentation in two forms, a “classic” Windows HTML help file and also a MSHelp2 file, which integrates as a help collection into Visual Studio when the product is installed.

Historically the documentation has been provided in other formats, for example Dynatext, and looking forward we hope to make the documentation available using whatever technology provides the right level of stability and typesetting quality for the devices in use. For the future we hope that (especially) the ability to generate richly structured (X)HTML will provide the necessary flexibility. Many of the currently available formats, such as Windows help and formats for e-books are essentially packaged, compressed, HTML files. Using an HTML base rather than say, PDF, one is forced to sacrifice some typesetting quality, but gain a lot of flexibility enabling the documents to be read in novel ways. An iphone being used to browse the (XHTML) NAG documentation for example.

Even without changing the format used, there are issues in keeping up with technology. Currently, for example we generate PDF versions of our documents using a relatively old version of Acrobat Distiller. More recent versions of Acrobat use some new language features to make much smaller PDF files, at the price of loss of compatibility with older readers. It is hard to know if support for older PDF readers matters or not (does it?), but the bandwidth savings, and space on restricted media such as CD, probably mean that we will switch at some point. Another potential change is the upcoming HTML5 revision of HTML. Currently, in order to get the best possible mathematical rendering, we use MathML in our documentation, but this currently necessitates the use of XHTML rather than HTML, which unfortunately makes things slightly harder to set up in the browser. HTML5 promises to specify the parsing of MathML within HTML documents, so hopefully in future this complication will be gone, and also, the prominence given to MathML by its reference from HTML5 may encourage more browser makes to add native MathML support.

So …

  • Do you prefer the single file versions (single file PDF, Windows help) or the multiple file versions (individual PDF or HTML files?
  • At what point is it acceptable to move to newer PDF or HTML standards?
  • How important is it that documentation for alternative interfaces take on the style of the documentation in the host environment? (For example, NAG Toolbox for MATLAB documentation using a similar style to MATLAB, or the NAG Library for .NET documentation using a .NET/MSDN style.)
  • What other features would you like in the product documentation?

Thursday, 18 February 2010

Exascale or personal HPC?

Which is more interesting for HPC watchers - the ambition of exaflops or personal supercomputing? Anyone who answers "personal supercomputing" is probably not being honest (I welcome challenges!). How many people find watching cars on the local road more interesting than F1 racing? Or think local delivery vans more fascinating than the space shuttle? Of course, everyday cars and local delivery vans are more important for most people than F1 and the space shuttle. And so personal supercomputing is more important than exaflops for most people.

High performance computing at an individual or small group scale directly impacts a far broader set of researchers and business users than exaflops will (at least for the next decade or two). Of course, in the same way that F1 and the shuttle pioneer technologies that improve cars and other everyday products, so the exaflops ambition (and the petaflops race before it) will pioneer technologies that make individual scale HPC better.

One potential benefit to widespread technical computing that some are hoping for is an evolution in programming. It is almost certain that the software challenges of an exaflops supercomputer with a complex distributed processing and memory hierarchy demanding billion-way concurrency will be the critical factor to success and thus tools and language evolutions will be developed to help the task.

Languages might be extended (more likely than new languages) to help express parallelism better. Better may mean easier or with assured correctness rather than higher performance. Language implementations might evolve to better support robustness in the face of potential errors. Successful exascale applications might expect to make much greater use of solver and utility libraries optimized for specific supercomputers. Indeed one outlying idea is that libraries might evolve to become part of the computer system rather than part of the application. Developments like these should also help to make the task of programming personal scale high performance computing much easier, reducing the expertise required to get acceptable performance from a system using tens of cores or GPUs.

Of course, while we wait for the exascale benefits to trickle down, getting applications to achieve reasonable performance across many cores still requires specialist skills.

Monday, 15 February 2010

Loading DTDs using DOM in Python

I use Python's xml.dom.minidom to process XML, but I'm a bit of a neophyte. I find it a really excellent approach for generating Fortran interfaces from our XML interface-specifications, but one thing's pretty inconvenient: entities don't get resolved, and we use a lot of entities.

In fact, it's worse than that. Entities just disappear in the processed DOM tree.

Given times.dtd
<!ENTITY times "&#215;">
and times.xml
<?xml version="1.0"?>
<!DOCTYPE times SYSTEM "times.dtd">
<maths>
  <mn>2</mn>
  <mo>&times;</mo>
  <mn>3</mn>
  <mo>=</mo>
  <mn>6</mn>
</maths>
and then times.py
import sys, xml.dom.minidom
sys.stdout.write(xml.dom.minidom.parse("times.xml").toxml())
you get
> python times.py
<?xml version="1.0" ?><!DOCTYPE times  SYSTEM 'times.dtd'><maths>
  <mn>2</mn>
  <mo/>
  <mn>3</mn>
  <mo>=</mo>
  <mn>6</mn>
Belgium! <mo>&times;</mo> has turned into </mo>.

I find the Python documentation for xml.dom quite daunting. From what I can tell you should be able to configure the whole experience—the parser, the entity resolver, one lump or two...

Until I work out how to do all that, my intermediate solution is to preprocess using xmllint to expand entities, before calling minidom: here's times2.py
import os, subprocess, sys, xml.dom.minidom
cmd_fo = open("times_expanded.xml", "w")
fail = subprocess.call("xmllint --loaddtd --noent " +
                       "times.xml",
                       shell=True,
                       stdout=cmd_fo,
                       stderr=sys.stderr,
                       close_fds=(os.name=="posix"),
                       universal_newlines=True)
cmd_fo.close()
sys.stdout.write(xml.dom.minidom.parse("times_expanded.xml").toxml())
which results in
> python times2.py
<?xml version="1.0" ?><!DOCTYPE times  SYSTEM 'times.dtd'><maths>
  <mn>2</mn>
  <mo>×</mo>
  <mn>3</mn>
  <mo>=</mo>
  <mn>6</mn>
Not very elegant though, is it?

Good news everyone: there's an alternate solution over at Stack Overflow, but it's still not perfect: use lxml instead of xml.dom.minidom. Unfortunately lxml doesn't come with the standard Python distribution, so I had to use my package manager to install python-lxml.

This time with a times3.py
import sys
from lxml import etree

parser = etree.XMLParser(load_dtd=True)
doc_DOM = etree.parse("times.xml", parser=parser)
sys.stdout.write(etree.tostring(doc_DOM) + '\n')
we get
> python times3.py
<maths>
  <mn>2</mn>
  <mo>&#215;</mo>
  <mn>3</mn>
  <mo>=</mo>
  <mn>6</mn>
</maths>
I think for the time being I'll stick with xmllint plus xml.dom.minidom, for greater portability.

Wednesday, 10 February 2010

If you could add or change one thing about the NAG Library?

2010 is an important year for NAG. We’ll be celebrating our 40th anniversary, which is quite an achievement in the software world. I’ve been able to find a few other software organisations that have also reached 40 – Software AG, Intergraph and Cincom (if you know of any others, please let me know.)

An anniversary is a great time to reflect on all that’s gone by, honour achievements and remember important milestones but of equal importance, if not more so, it’s a time to look to the future and plan ahead. The NAG Library – the mainstay of NAG’s product portfolio - reaches the dizzy heights of 40 in October 2011. The Library has changed a lot over the years. It was originally a collection of 90 or so routines written in both Algol 60 and Fortran targeted at ICL mainframes. It now features over 1,400 routines and is available on a multitude of platforms, operating systems and programming environments. How will it change over the next 40 years?

So here’s another question for you, assuming you use or have used the NAG Library “If you could add or change one thing about the Library, what would it be?”

Thursday, 4 February 2010

Don't call it High Performance Computing?

Having just signed up for twitter (HPCnotes), I've realised that the space I previously had to get my point across was nothing short of luxurious (e.g. my ZDNet columns). It's like the traditional challenge of the elevator pitch - can you make your point about High Performance Computing (HPC) in the 140 character limit of a tweet? It might even be a challenge to state what HPC is in 140 characters. Can we sum up our profession that simply? To a non-HPC person?

The inspired John West of InsideHPC fame wrote about the need to explain HPC some time ago in HPCwire. It's not an abstract problem. As multicore processors (whether CPUs or GPUs) become the default for scientific computing, the parallel programming technologies and methods of HPC are becoming important for all numercial computing users - even if they don't identify themselves as HPC users. In turn, of course, HPC benefits in sustainability and usability from the mass market use of parallel programming skills and technologies.

I'll try to put it in 140 characters (less space for a link): Multicore CPUs promise extra performance but software must be optimised to take advantage. HPC methods can help.

It's not good - can you say it better? Add a comment to this blog post to try ...

For those of you finding this blog post from the short catch line above, hoping to find the answer to how HPC methods can help - well that's what my future posts and those of my colleagues here will address.

Wednesday, 3 February 2010

Make considered painful

Nag uses a build system based on make and makefiles. When we started doing things this way not enough years ago, we read and took the advice in Peter Miller's Recursive Make Considered Harmful to heart. As a result you can end up with a huge monolithic makefile. You can of course break this down a bit by using include makefiles, so you could, potentially, have a tree of makefiles included inside one another. We don't do much of that; we do the important thing of separating all the implementation specific stuff into a single include file which then defines the build, but the rest is still pretty monolithic. There are obvious disadvantages in maintaining such a beast, but there are advantages too:
  • global search and replace;
  • don't have to find out which makefile does what.
I'm aware that make is a fairly old technology now, although we did standardise on GNUmake which added quite a number of useful features over the years. However, I don' think it handles things like Fortran module dependencies very well, so I occassionally go on the look out for make replacements. In recent times I had a look at cons and scons as possible replacements, but was made nervous by certain things:
  • converting our very large makefiles would take some effort;
  • some reports that scons runs very slow for big builds;
  • having to build up an expertise in scons among our developers replacing existing expertise with makefiles.
I've got as feeling though that life with make is going to get more painful as time goes on.

Tuesday, 2 February 2010

How to get more done tomorrow – part 1

My name is Rob Meyer and I've got too much to do and not enough time to do it all. In fact, if I stopped going to meetings today and quit reading my mail, I could spend the next six months on my "to dos" and not be finished. Does this describe you? Wasn't software supposed to fix all this?

Just about everybody I know in the industry, whether at NAG, our customers or others I talk to on a daily basis struggles to keep up. Is there a tool we can use to be more productive and get a greater sense of accomplishment at the end of each day? The answer is yes but it's (mostly) not computer software, it's in that other software between your ears.

For all the modern hardware and software we have today, the challenge of getting and staying organized and productive is getting worse, not better. Our daily weather forecast calls for a blizzard of data with occasional intervals of information and insight. Over the years I've read a number of books, tried various software packages and other approaches but one of the more recent ones I've read has proven very useful though not a panacea. It is "Getting Things Done" by David Allen, and no, you can't borrow mine because I pull it out every few weeks to reread a few pages and strengthen my resolve.

I can't begin to tell you everything worthwhile in this book but I can give you a couple of ideas that you can start using tomorrow. One of the greatest impediments to getting and staying organized and product is a poorly written "to do". For example, if I diligently write on my action list "Get personal finances organized" I will likely be looking at this entry every week for the next six months without any discernible progress because it's not at all clear what I should actually "do" to "get personal finances organized" or how I will be able to tell when it's done.

Actually this is a useful description of a project so hold onto the concept. A better description of an action might be "balance checkbook" which has a reasonably well defined process associated with it and a definable conclusion (my balance equals the bank's balance and all debits and credits are accounted for).

To get more done tomorrow take a look at your list of actions to do (surely you have one written down?) and rewrite the ones where you can't articulate quickly the measure of being "done". Don't lose the thing you replace because it might be the overall goal of a larger project. Now, when you actually complete your reformulated action "balance checkbook" two things should happen: a sense of accomplishment for getting something done quickly followed by the critical question, "now that's done, what is the next action?" And on you go until the big project (get personal finances organized) is completed.

The absolute, iron-clad rule to all of this is to write down an "actionable" next step. Well, not quite iron-clad. Next time I'll talk about the 2-minute rule.

Monday, 1 February 2010

Colorised svn diffs

With the help of colordiff my svn diffing just got a whole lot prettier:
% svn diff --diff-cmd colordiff -x -w files |\
    less -RS
(I like to pass -w to the diff-er to ignore whitespace changes). Piping through less in raw mode (-r/-R) interprets the colour escape-sequences correctly.

Thanks to commandlinefu for the idea.