Tuesday, 15 February 2011

Calling NAG routines from IDL

The collection of numerical algorithms contained in the NAG Library has been a valuable and widely-used resource for many years, and it's something that we're continually updating and maintaining lest the expression "has been" becomes an uncomfortably relevant adjective. To this end, we're - for example - developing and testing new routines for the next Mark of the Library, have just released a new version of the Library for the .NET environment, and are continuing to try and make our algorithms accessible from as many languages and packages as possible. Besides .NET, these include MATLAB®, Excel, Maple, Java, Python and many others (some of which are listed on this page). In spite of our best efforts in this area however, there will always be environments that we've yet to cover fully.

Fortunately, our users - besides possessing a commendably high level of discernment in their selection of numerical libraries - are talented and resourceful individuals. For example, we've recently been in contact with a group at Culham Centre for Fusion Energy, which is the UK's national fusion research laboratory. They're users of the popular IDL programming language and visualization system, and were keen to enhance IDL's functionality by calling NAG Library routines from within that environment. They found that they were able to do this by writing a C wrapper around each Fortran routine call, and then passing the definition of the wrapper routine to IDL via its Dynamically Loadable Module (DLM) packaging method. We highlight their work - which they've been kind enough to share with us - using the E02BEF routine as an example; this calculates the cubic spline approximation to a set of data points (and has previously formed the basis of one of the demos of the NAG Toolbox for MATLAB®)).

We first give the wrapper function, which checks its input parameters and then calls the NAG routine. The function prototypes for all the routines in the NAG Fortran Library are contained in the header file nagmk22.h; this is part of the Fortran Library distribution, and makes it easier to call any of the routines from within C or C++. We note in passing that the wrapper is only required because we're using the NAG Fortran Library. IDL is written in C, so DLMs must also be written in C; if the NAG C Library was being used instead, this code would be simpler.

void IDL_CDECL idl_e02bef(int argc, IDL_VPTR argv[], char *argk){
// IDL wrapper function for NAG routine E02BEF.
// Arguments: 
//  0 start  char*   input  "C" or "W": cold or warm start
//  1 NP     int     input  number of data points
//  2 x      double* input  independent (abscissa) values      
//  3 y      double* input  dependent (ordinate) values
//  4 w      double* input  weights
//  5 s      double  input  smoothing factor 
//  6 nest   int     input  over-estimate of the no. of knots
//  7 n      int     input  prior no. of knots for a warm start
//                   output total number of spline knots      
//  8 lambda double* input  prior knot positions
//                   output knot positions
//  9 c      double* output spline coefficients
// 10 fp     double  output sum of square weighted residuals 
// 11 wrk    double* input  prior work array for a warm start
//                   output work array
// 12 lwrk   int     input  size of work array: 4*NP+16*nest+41
// 13 iwrk   int*    input  prior integer work array (warm start)
//                   output integer work array
// 14 ifail  int     input  must be set to 0, -1 or 1
//                   output 0 unless an error occurred
//      call E02BEF(start, NP, x, y, w, s, nest, n, 
//                  lamda, c, fp, wrk, lwrk, iwrk, ifail) 
   char start[2];
   int i, NP, nest, lwrk, n, *iwrk, ifail, length_start;
   double s, fp, *x, *y, *w, *lambda, *c, *wrk;
// Import all arguments, and check their values.
// First, the input parameters only.      
   strncpy(start, IDL_STRING_STR(&argv[0]->value.str), 1);
   start[1] = '\0';
   if(strlen(IDL_STRING_STR(&argv[0]->value.str)) != 1) 
      fprintf(stderr,"Start String Length Incorrect\n");

   NP = (int)IDL_LongScalar(argv[1]); 
   x = (double *)argv[2]->value.arr->data;   
   if(argv[2]->value.arr->n_elts < NP) 
      fprintf(stderr,"X Array Too Small\n");
   y = (double *)argv[3]->value.arr->data;   
   if(argv[3]->value.arr->n_elts < NP) 
      fprintf(stderr,"Y Array Too Small\n");  
   w = (double *)argv[4]->value.arr->data;   
   if(argv[4]->value.arr->n_elts < NP) 
      fprintf(stderr,"W Array Too Small\n");  
   s = (double)IDL_DoubleScalar(argv[5]); 
   nest = (int)IDL_LongScalar(argv[6]); 
   lwrk = (int)IDL_LongScalar(argv[12]); 
// Second, the input & output parameters.  
   n = (int)IDL_LongScalar(argv[7]);
   lambda = (double *)argv[8]->value.arr->data;   
   if(argv[8]->value.arr->n_elts < nest) 
      fprintf(stderr,"Lambda Array Too Small\n");
   c = (double *)argv[9]->value.arr->data;
   if(argv[9]->value.arr->n_elts < nest) 
      fprintf(stderr,"C Array Too Small\n");
   fp = (double)IDL_DoubleScalar(argv[10]);
   wrk = (double *)argv[11]->value.arr->data; 
   if(argv[11]->value.arr->n_elts < lwrk) 
      fprintf(stderr,"Work Array Too Small\n");
   iwrk = (int *)argv[13]->value.arr->data;
   if(argv[13]->value.arr->n_elts < lwrk) 
      fprintf(stderr,"IWork Array Too Small\n");
   ifail = (int)IDL_LongScalar(argv[14]);        

// Call the NAG routine.  It was written in Fortran; here, we 
// assume we're running under Linux or Unix, where most 
// compilers convert the name to lower case and prepend an 
// underscore.  The function prototype is defined in nagmk22.h 
// (above).
// Because one of the parameters is a character string (and 
// we're calling Fortran from C), we also have to pass its 
// length.  Under this calling convention, this goes at the end 
// of the argument list.
   length_start = strlen(start); 
   e02bef_(start, &NP, x, y, w, &s, &nest, &n, lambda, c, &fp, 
      wrk, &lwrk, iwrk, &ifail, length_start); 

// Check for unsuccessful call of NAG routine (see documentation)
// and output warning if necessary.
   if(ifail != 0) 
      "Warning: e02bef returned with ifail = %d\n", ifail);
// Return the scalar data to IDL (we don't have to explicitly 
// return the arrays, as they're automatically passed by 
// pointer/reference).
   IDL_StoreScalar(argv[7],  IDL_TYP_LONG,   (IDL_ALLTYPES*)&n);
   IDL_StoreScalar(argv[10], IDL_TYP_DOUBLE, (IDL_ALLTYPES*)&fp);
   IDL_StoreScalar(argv[14], IDL_TYP_LONG, 

Each NAG routine that we want to use would require a separate wrapper, constructed along similar lines to those illustrated in this example.

Here's the rest of the code (which should precede the specification of the wrapper function in our file - which could be called, say, idlnag.c). Along with the NAG function prototypes, we also include idl_export.h, which defines the IDL API. Next, after defining a useful macro and the prototype for the wrapper function, we set up the DLM by first defining a structure idl_nagprocedures that will be used to pass information about our routine to IDL, and then using that in the specification of a function called IDL_Load. Every DLM must have this function, which is called when the DLM is first loaded; it calls IDL_SysRtnAdd to load the routine definition, and then defines the exit handler (which does nothing in our case).

#include <stdio.h>
#include <string.h>
#include "idl_export.h" // IDL API Header
#include "nagmk22.h"    // NAG C/Fortran Prototypes

// Useful macro.  
#define ARRLEN(arr) (sizeof(arr)/sizeof(arr[0]))

// Procedure prototypes - one for each routine in this module.
extern void IDL_CDECL idl_e02bef(int argc, 
   IDL_VPTR argv[], char *argk);

// Structure used to pass information about our routine to IDL.  
// Other routines would require similar entries in this table.
static IDL_SYSFUN_DEF2 idlnag_procedures[] = {
   {(IDL_FUN_RET) idl_e02bef, // name of routine in here.
   "IDL_E02BEF",              // name of the IDL procedure.
   15,            // Min no. of arguments for routine.
   15,            // Max "   "      "      "     " 
   0,             // Further options which we don't use.
   0              // Reserved for IDL use - always set to 0.

// This function is called when IDL is shut down.
void idlnag_exit_handler(void) {
   // Nothing to do.

int IDL_Load(void) {
// This function is called when IDL loads this module.  

// Call IDL_SysRtnAdd to register the routine contained in this 
// module.  Note that this routine is what IDL calls a 
// procedure - i.e. it doesn't return a value (cf. what IDL 
// calls a function, which does).  This is specified by setting 
// the second parameter in the call to FALSE (it's TRUE for 
// functions). 
   if (!IDL_SysRtnAdd(idlnag_procedures, FALSE, 
      ARRLEN(idlnag_procedures))) return IDL_FALSE;

// Register the exit handler.


idlnag.c must then be compiled and linked into a shared library (called idlnag.so on Unix) which should be placed in a location on the directory path specified by the IDL_DLM_PATH environment variable. The other file that must be in that location is the so-called DLM module description file. In our example, its contents would be:

MODULE idlnag                    # Name of the DLM
DESCRIPTION IDL Interface to NAG # Optional one-line description
VERSION 1.0                      # Optional version number
PROCEDURE IDL_E02BEF 15 15 # Routine name, min, max no. of args

Here, the name of the DLM must be the same as the name of the shared library we've created, whilst the routine name and the minimum and maximum number of arguments must be the same as that specified in the IDL_SYSFUN_DEF2 structure - i.e. idlnag_procedures in our case.

Having defined the new function in this fashion, it will be loaded the next time that IDL is started, and it can be invoked within IDL as

IDL_E02BEF, start, NP, x, y, w, s, nest, n, lambda, c, fp, $
            wrk, lwrk, iwrk, ifail

where start is an IDL variable of type string, NP is a long, x, y and w are double arrays of size NP, s is a double, nest and n are longs, lambda and c are double arrays of size nest, fp is a double, lwrk is a long, wrk is a double array of size lwrk, iwrk is a long array of size nest and ifail is a long. The next step would be to call E02BBF, which evaluates the spline curve using the parameters calculated by E02BEF - specifically, the knot positions lambda(nest), and the spline coefficients c(nest) - see this demo for an illustration (albeit using a different package).

A more complete description of how to call external routines from within IDL is available here. That account also illustrates the use of IDL keywords to modify the behaviour of the function; their use has been eschewed in this example in order to retain the exact correspondence between the argument list for the NAG Fortran Library routine and the way in which it's called in IDL. We've also opted for a simple error-handling mechanism in the wrapper; alternatives which are better-integrated with IDL are possible. Finally, we note that more information on IDL DLMs is available from the IDL documentation, which can be found in various locations (such as here, for example).

Wednesday, 9 February 2011

Wandering Precision

If you call the same numerical library routine twice with exactly the same input data, you expect to get identical results - don't you?

Well, it all depends. If you're calling a routine that somehow uses random numbers, for example a solver based on a monte carlo method, then the results might be different - but then you'd probably accept that (assuming that you'd been warned in the documentation).

But what if you're calling a numerical routine that in theory contains no randomness, but you still get slightly different results? And what if those slightly different results cause you problems - for example if you depend on repeatability for testing purposes? Then you might waste a lot of time hunting for the cause.

Well, this problem has turned up more than once recently for us at NAG, and in the cases we've looked at it has turned out to be a problem with the way the machine does its arithmetic - or rather, in the way that the compiler tells the machine to do its arithmetic.

Let me give you an example. Suppose you want to compute the inner product of two vectors x and y, each of length n. Simple - you just multiply each element of x by the corresponding element of y, then add all the products together to get the answer. No apparent randomness there. What could go wrong?

Well, if the hardware you are running on is equipped with Streaming SIMD Extension (SSE) instructions - and almost all modern hardware is - here's what.

SSE instructions enable low-level parallelism of floating-point arithmetic operations. For example, you can hold four single precision numbers at the same time in a 128-bit register, and operate on them all at the same time. This leads to massive time savings when working on large amounts of data.

But this may come at a price. Efficient use of SSE instructions can sometimes depend on exactly how the memory used to store vectors x and y is aligned. If it's aligned nicely - by which I mean, in the inner product example, that the addresses of the first elements of the arrays x and y are multiples of 16 bytes - then that's good. The hardware can efficiently move numbers from memory into registers to work on them, using instructions that depend on that alignment. So for our inner product, with a good optimizing compiler, we'd load numbers four at a time, multiply them together four at a time, and accumulate the results as we go along into our final result.

But if the memory is not nicely aligned - and there's a good chance it may not be - the compiler needs to generate a different code path to deal with the situation. Here the result will take longer to get because the numbers have to be accumulated one at a time. At run time, the code checks whether it can take the fast path or not, and works appropriately.

The problem is that by messing with the order of the accumulations, you are quite possibly changing the final result, simply due to rounding differences when working with finite precision computer arithmetic.

Instead of getting

  s = x1*y1 + x2*y2 + x3*y3 + ...
you get
  s = (x1*y1 + x5*y5 + x9*y9 + x13*y13) +
      (x2*y2 + x6*y6 + x10*y10 + x14*y14) + ...

Chances are that the result will be just as accurate either way - but it's different by a tiny amount. And if that tiny difference leads to a different decision made by the code that called the inner product routine, the difference can be magnified.

In some cases, for example when solving a mathematical optimization problem using a local minimizer, the final result can be completely different - and yet still valid - because a local minimizer might converge to any one of several different minimum points, if the function does not have a unique minimum.

And all because of how a piece of memory happens to be aligned. If you allocated the memory yourself you might be able to do something about it, but on the other hand the alignment might be completely outside your control. NAG users have reported cases where running a program from the command line would give consistently different results from running the same program by clicking on its name in Windows Explorer.

Is there anything we can do about problems like this? The fact is that NAG users want their code to run fast, so using SSE instructions makes sense. We can lobby the compiler writers to ask them to be careful how they optimize the code they produce, but they won't necessarily take notice. Compiler writers could also help by making their memory allocation routines only return 16-byte aligned memory, specifically to help with the use of SSE instructions. In the past, though, I had no success trying to convince the gfortran compiler folks to do that. In any case, even if they did, it wouldn't always help - if the first element of a single or double precision array is aligned on a 16 byte boundary, the second element will definitely not be, so if you want to operate on a vector starting at that location you've no chance.

We could choose not to use SSE instructions at all. But, apart from efficiency reasons, the "legacy" extended precision 80-bit register arithmetic which started in the 1980s with the 8087 numeric co-processor had its own "wobbling precision" problems for numerical code.

As new hardware with AVX instructions and 256 bit registers comes along, even more numerical work can be done in parallel. So - it seems that for the foreseeable future we're just going to have to live with this, and try to minimize the effects on NAG users by means of documentation.