## Wednesday, 23 June 2010

### My first experiences calling the NAG Library for SMP and Multicore

A quick introduction: I work for NAG as a Technical Support Engineer in Germany. I look after clients in Germany, Switzerland and Austria and am interested in approximation and interpolation methods as well as .NET languages. Naturally, I’m big fan of the NAG library for .NET as well as the NAG Library for SMP and Multicore.

As a non FORTRAN programmer to be able to run some tests with the new NAG Library for SMP and Multicore I decided to choose a few simple tests that I could run using both CPUs of my laptop. I created some .NET console applications, wrappers around the FORTRAN routines that were able to pass all the arguments to the routines in a .NET fashion and implemented a time counting function. There are many parts of the Library parallelized, including quadrature, partial differential equations, interpolation, curve and surface fitting, linear algebra, correlation and regression analysis, multivariate methods, random number generators, time series analysis, sorting and special functions. I chose routines from the correlation, curve and surface fitting and random number generation chapters.

The results were calculated on a Intel® Core™2 Duo Prozessor P8700 (2.53GHz,1066MHz,3MB) machine with Windows Vista as operating system and Microsoft Visual Studio 2008 (32‐bit project) as compiler. Each function was used to solve a large enough problem to allow for parallelism. Each test case was run with one or two threads. The number of threads was set using the system variable ‘OMP_NUM_THREADS’ in the Visual Studio console window.

A ‘QueryPerformanceCounter’ function, was implemented in all the examples to calculate the time which the routines needed for calculating the result. Below is the class implemented for each of the examples which can be used by first creating the constructor HiPerfTimer and then using the methods Start, Stop and Duration:
HiPerfTimer pt = new HiPerfTimer();
pt.Start();
//routine
pt.Stop();
pt.Duration();

internal class HiPerfTimer
{
[DllImport("Kernel32.dll")]
private static extern bool QueryPerformanceCounter(
out long lpPerformanceCount);

[DllImport("Kernel32.dll")]
private static extern bool QueryPerformanceFrequency(
out long lpFrequency);

private long startTime, stopTime;
private long freq;

// Constructor

public HiPerfTimer()
{
startTime = 0;
stopTime = 0;

if (QueryPerformanceFrequency(out freq) == false)
{
throw new Win32Exception();
}
}

// Start the timer

public void Start()
{

QueryPerformanceCounter(out startTime);
}

// Stop the timer

public void Stop()
{
QueryPerformanceCounter(out stopTime);
}

// Returns the duration of the timer (in seconds)

public double Duration
{
get
{
return (double)(stopTime - startTime) / (double)freq;
}
}
}

After compiling the examples with the C# compiler option ‘csc’, the number of threads is first set to the value 1 and then set to the value 2. The only result which is printed is the time which the routine needs for calculating the result.

Correlation

NAG routine C06PKF calculates the circular convolution or correlation of two complex vectors of period n.
In this example to complex vectors are build up with a period of n (n=5000000) and the correlation and the circular convolution are calculated.
C: \C06PKF\C06PKF>csc c06pkf.cs

C:\C06PKF\C06PKF>c06pkf
Duration: 5.74 sec

C: \C06PKF\C06PKF>c06pkf
Duration: 2.98 sec


Random Number Generation

NAG routine G05YKF generates a quasi-random sequence from a log-normal distribution. It must be preceded by a call to one of the initialization routines G05YLF or G05YNF. The number N (N = 10000000) of quasi-random numbers of dimension IDIM (IDIM = 4) are passed and also the mean (IDIM) of the underlying Normal distribution for each dimension and the std (IDIM) standard deviation of the underlying Normal distribution.
C: \ G05YKF\G05YKF>csc c06pkf.cs

C: \G05YKF\G05YKF>g05ykf
Duration: 3.52 sec

C:\ G05YKF\G05YKF>g05ykf
Duration: 2.06 sec


Approximation

NAG Routine E02CAF forms an approximation to the weighted, least-squares Chebyshev series surface fit to data arbitrarily distributed on lines parallel to one independent coordinate axis. It determines a bivariate polynomial approximation of degree k in x and l in y to the set of data points , with weights , for and . That is, the data points are on lines , but the x values may be different on each line. The polynomial is represented in double Chebyshev series form.
N, the number of lines on which data points are given, was set to 1000
K, the required degree of x was set to 100 and
L, the required degree of y was set also to 100.
Also the fitting polynomial was evaluated at the data points using the routine E02CBF.
C:\E02CAF\E02CAF >csc e02caf.cs

C:\ E02CAF\E02CAF >e02caf
Duration: 4.77 sec

C:\ E02CAF\E02CAF >e02caf
Duration: 2.48 sec


Fig 1: Timings for the SMP enabled NAG routines with one or two threads
My experience with presenting these examples at different clients has been very positive, but also interesting! “Warum verwenden Sie diese Bibliothek in .NET? Das ergibt doch keinen Sinn, da wir nur in Fortran programmieren!” “Why are you using the Library in .NET and not directly in Fortran? This doesn’t make any sense, we are coding only in Fortran!”
Others were excited to learn and understand they might use their second processor!
My conclusions are as follows
• Never show a C# program to HPC Computer Programmers who love Fortran and think languages such as C are radical!
• NAG Library for SMP and Multicore really is easy to use and for those who are current NAG Fortran Library users the transition really is painless. Users just need to take care of the number of threads used.
I should admit a bias as a NAG employee, but perhaps more importantly I’d like to use this blog to talk to users / prospective users and offer help. As I said in my introduction I am interesting in this, but also hearing from those who have interests in approximation and interpolation. I plan to work with some of my developer colleagues to expand some of NAG’s interpolation solvers.

## Tuesday, 22 June 2010

### Technical computing futures part 2: GPU and manycore success

In my previous blog, I suggested that the HPC revolution towards GPUs (or similar many-core technologies) as the primary processor has a lot in common with the move from RISC to commodity x86 processors a few years ago. A new technology appears to offer cheaper (or better) performance than the incumbent, for some porting and tuning pain. Of course, I’m not the first HPC blogger to have made this observation, but I hope to follow it a little further.

## Wednesday, 16 June 2010

### The meaning of recognition

A few years ago, NAG decided to brush up its public image and issued its staff with The Company Shirt (actually, it turned out that there was so much money in the marketing budget that each of us was able to have our own shirt, as opposed to being obliged to take it in turns to sport a single item of clothing). It's a rather splendid garment (you can see my colleague Mike Dewar elegantly modelling his below) that proudly but discreetly displays the company logo and, for good measure, the full name of the company (lest the abbreviation be misinterpreted as an exhortation to complain endlessly). The idea - which, I'd imagine, is common to just about every organization in the world - is that the shirt can be worn on exhibition stands, when giving commercial presentations or making customer visits so that a (somewhat loosely) unified image of the company is presented to the outside world. Members of staff have acceded to this idea with varying degrees of alacrity; speaking for myself - following an initial period of uneasiness where I suspected (quite without foundation) that the next step on the road to an improved image would be The Company Song - I've been happy to wear The Shirt on every appropriate occasion.

And on some less appropriate ones as well. For example, last night I was taking part in a choir rehearsal as part of the preparation for a well-known religious leader's visit to the UK later this year. Not having had time to get changed after work, I was wearing The Shirt as we collectively negotiated the joys of counting bars, leaping fourths, subdividing triplets and other more applied forms of numerical analysis. Approaching the conductor - who'd been brought in from another parish in order to adeptly marshal our enthusiastic but slightly unfocussed efforts - with a technical question at the end of the rehearsal, I was a little surprised when he asked if I worked for NAG. Wondering if he was about to quiz me about - say - our optimization routines, I replied - somewhat cautiously - in the affirmative. "Great stuff," he responded. "I used the NAG Library all the time at university when I was programming in Fortran - it was really, really good." Owing to the context, the generous and unlooked-for compliment was so surprising that I forgot to say that these days, the Library wasn't only available to Fortran programmers (on reflection, perhaps that was just as well on this occasion) but I was also reminded that this kind of encounter isn't at all uncommon. Given the remarkable age of the company, perhaps it's only to be expected that you frequently bump into users - or ex-users - of your products, but it's still gratifying when they're able to share positive experiences - or happy memories - of it. Maybe I should start work on that Company Song after all.

## Tuesday, 8 June 2010

### Revealing the future of technical computing: part 1

I recall some years ago porting an application code I worked with, which was developed and used almost exclusively on a high end supercomputer, to my PC. Naively (I was young), I was shocked to find that, per-processor, the code ran (much) faster on my PC than on the supercomputer. With very little optimization effort.

How could this be – this desktop machine costing only a few hundred pounds was matching the performance of a four processor HPC node costing many times that? Since I was also starting to get involved in HPC procurements, I naturally asked why we spend millions on special supercomputers, when for a twentieth of the price, we’d get the same throughput from a bunch of high-spec PCs?

## Monday, 7 June 2010

### Why is writing good numerical software so hard?

People sometimes say to us "Why does NAG continue to exist? Surely all the good numerical algorithms have already been devised and implemented?". Well, this question ignores the fact that people keep finding new kinds of problem to solve. It also forgets that we need to make our current software continue to work robustly and efficiently as new hardware and software infrastructures develop - we need our code to run on all the platforms and in all the environments that our customers use.

Making the NAG library work with a new compiler is always a non-trivial task (and sometimes it can seem more like a nightmare - just ask some of my colleagues!) It may be because we've got some code that's not yet been tested in a wide enough variety of environments. Or it may be because compiler optimization switches can reorder our code in ways that make the floating-point arithmetic behave in a way that we hadn't anticipated (which could be the compiler's fault or could be our fault, depending). But whoever is to blame, we have to fix it.

When you are dealing with complicated numerical codes, tracking down a problem might take a lot of hard work with a debugger (or with print statements if you're unlucky enough to be in an environment where a debugger just won't work). And, what look like even the simplest operations might turn out to be a lot more complicated than you'd expect.

As an example, take the method of dividing one complex number by another. Languages like Fortran have in-built support for complex numbers and complex arithmetic, but not all do, and so the NAG Library does contain some routines to help.

Performing a complex division is simple in principle. If the numbers are called X and Y, in general they are composed of real and imaginary parts, let's call them X.re and X.im, Y.re and Y.im. Then to compute the value Z = X / Y, just multiply both the top and bottom halves of the quotient by the complex conjugate of Y (i.e. the complex number you get when you negate the imaginary part of Y). Since we're multiplying the top and bottom by the same thing, there's no net effect on the answer, but the clever part is that when you multiply a complex number by its own conjugate, the answer is real. That means that Y * conjugate(Y) is real, and the complex division operation is reduced to an (easier) complex multiplication followed by a few real divisions. The complex multiplication is done by cross multiplying (in real arithmetic) the real and imaginary parts of X and Y, then adding and subtracting appropriate pieces to form the real and imaginary parts of the result. Counting all the floating-point operations up, this turns out to be 6 multiplies, 3 additions or subtractions, and 2 divides - a total of 11 operations - quite significant compared to a single real division, but still not a lot of work.

So - it can't be all that hard to get it right can it?

Wrong. We need to take some care here. For example, if the complex numbers X and Y are very large (and I'm talking about larger than you might ordinarily meet - say about the size of or bigger than the square root of the largest number you have on your computer), there's a chance that although the number Z = X / Y is a perfectly reasonable number, some of those intermediate 11 computations could cause arithmetic overflow. You'll get garbage as a result - not good. People are not happy if they divide a large number by itself and get something not equal to 1!

The implementation of complex division that had been in the NAG Library since the early 1970s took care to avoid unnecessary overflow and underflow by doing some cunning scaling and rearranging into 4 multiplies, 4 adds or subtracts, and 3 divides. The algorithm is documented here in the Mark 22 NAG Library Manual

and I'd never given it much thought - I always assumed this algorithm was fine.

Well, we recently discovered that the method used by routine A02AC was not as perfect as I'd thought! The reason? Old age, really. The algorithm was based on a method by Robert Smith, published in Communications of the ACM way back in August 1962, and this had been developed in the dark days before the IEEE standard for floating-point arithmetic was designed in the 1980s (Standard for Binary Floating-Point Arithmetic (IEEE 754-1985)).

The IEEE standard introduced the concept of denormalized numbers - tiny but non-zero numbers which have less precision than regular numbers. It turned out that if you fed these into the NAG routine, you might still end up with garbage. For example, dividing the number X which has real and imaginary parts equal to the same denormalized number would return not the expected value 1, but a complex number composed of real part infinity and imaginary part NaN (NaN stands for "Not a Number"). Yuk.

Does it matter? Probably not much - not many people would call a NAG routine for something as "simple" as complex division (they'd rely on their compiler) and even if they did the chances are they would not be using denormalized numbers. But, a curse of working at NAG is that as soon as you find out about a problem like this you feel obliged to do something about it, and that's what my colleague Mat did. He located newer published methods which did take account of denormalized numbers, by introducing extra scaling operations, and based a revised version of the NAG routine on one of those. I won't bore you with the details of it here - suffice it to say that the newer method is a bit more complicated, but deals with the problem nicely, and it will debut in the next version of the NAG Library.

This simple case illustrates that even for the most basic numerical software things can go wrong if you don't take care. Imagine what it's like for something complicated like the quadratic formula! (I'm only half joking - safely finding the roots of a quadratic equation takes a frighteningly large number of lines of code).

## Thursday, 3 June 2010

### To people who provide technical support

After feedback from very satisfied NAG users, it seems to be high time to publicly appreciate the folks who work in front line technical support teams. To point out what great value they provide to customers, not just for NAG but across many software organisations worldwide. It is a very important job that involves much more than answering questions. It is about asking the right return questions and about careful communications.

The idea of a TV detective springs to mind; working the crime scene at an upmarket golf club near Hollywood, with a smart party in the background… image Detective Lt Colombo, in the clubhouse, picking his words with oh-so-much care, and asking questions in the most un-intimidating way. At the same time, he has four suspects lined up at the back of his mind and is waiting to discount each one as the facts of the case (or support problem) are drawn out into the light.

Who at NAG?
------------ What! Not a doctor?

The support job sounds easy to start with: When someone gets in touch with a support desk it is because they need help. But it needs special skills to uncover all the important details that provide the clues as to what is going wrong. The support role needs an ability to understand what the user is trying to achieve and the mechanisms he or she wants to use. It needs enough knowledge to be able to challenge the user’s ideas of how their own system or application should behave and it needs the balance of authority and diplomacy to work through the possible solutions.

A little like the down at heel TV detective investigating the crime at the exclusive golf club, it is common for the person handling a support request to have no pre-knowledge about the user’s application or environment. So it can be a steep learning curve in support. But the rapid self-education in the user’s field can be critical to make for clear technical conversations. This is particularly so when communicating with NAG Library users, since they work in very many different industries and subject specialties.

The final twist in the detective plot is often the most important. Here, unlike the TV detective who has to expose the evil perpetrator, the very best support people don’t really say that the user may have made any mistakes. Instead they leave the user with new and valuable knowledge and often a favoured contact that they are happy to turn to if they need real help again in the years to come.

To end I need to be clear that this is not meant to suggest that support people are dishevelled, or that they dress badly (I am in no position to comment these subjects). And support teams actually do have an in-depth knowledge of many fields - at NAG these are the same type of people who develop the NAG library and provide all the detail support information http://www.nag.co.uk/Forms/support.asp and the diverse technical papers http://www.nag.co.uk/doc/techrep/index.asp that help users make the best use of the NAG library every day.

The analogy is intended to show how, like good detective work, the job is skilled, subtle and is critical. Many organisations are built on great support teams who provide astoundingly good help. Just listen to the users of NAG software saying – 'keep up the good work'.

## Tuesday, 1 June 2010

### NAG at ISC 2010

NAG is attending International Supercomputing 2010 in Hamburg, Europe's largest Supercomputing conference. Attendence is apparently at its highest ever and many of the sessions have been standing-room only. During Thomas Sterling's talk yesterday, on "Parallel Computing in the Years to Come", there were people sitting in the aisles and listening from outside in the corridor. The conference starts with the announcement of the latest Top 500 List, and the big news was that a Chinese system, Nebulae, was now number 2, behind Jaguar at Oak Ridge National Laboratory in the US. This is the second Chinese system to get into the top 10 and, like Tianhe-1, currently at number 7 in the list, it gets its performance from GPUs. In fact its theoretical peak performance is higher than Jaguar's, but it is hard to achieve that performance with existing software development tools. GPUs are very much the flavour of the moment. There are some interesting new products for GPU programming, in particular both Allinea and TotalView have versions of their parallel debuggers for CUDA code. The NAG booth has been very busy and it seems that at least half the people who drop by are interested in our own software for GPUs, currently available for beta testing. The feedbacks and suggestions for additional functionality will be a great help to us in bringing this product to market. The other interesting development in this area is that we're seeing increasing interest in software for OpenCL, largely due to the adoption of ATI cards. We'd love to hear from people who are using OpenCL and would like to make use of numerical components such as ours.