### My first experiences calling the NAG Library for SMP and Multicore

A quick introduction: I work for NAG as a Technical Support Engineer in Germany. I look after clients in Germany, Switzerland and Austria and am interested in approximation and interpolation methods as well as .NET languages. Naturally, I’m big fan of the NAG library for .NET as well as the NAG Library for SMP and Multicore.

As a non FORTRAN programmer to be able to run some tests with the new NAG Library for SMP and Multicore I decided to choose a few simple tests that I could run using both CPUs of my laptop. I created some .NET console applications, wrappers around the FORTRAN routines that were able to pass all the arguments to the routines in a .NET fashion and implemented a time counting function. There are many parts of the Library parallelized, including quadrature, partial differential equations, interpolation, curve and surface fitting, linear algebra, correlation and regression analysis, multivariate methods, random number generators, time series analysis, sorting and special functions. I chose routines from the correlation, curve and surface fitting and random number generation chapters.

The results were calculated on a Intel® Core™2 Duo Prozessor P8700 (2.53GHz,1066MHz,3MB) machine with Windows Vista as operating system and Microsoft Visual Studio 2008 (32‐bit project) as compiler. Each function was used to solve a large enough problem to allow for parallelism. Each test case was run with one or two threads. The number of threads was set using the system variable ‘OMP_NUM_THREADS’ in the Visual Studio console window.

A ‘QueryPerformanceCounter’ function, was implemented in all the examples to calculate the time which the routines needed for calculating the result. Below is the class implemented for each of the examples which can be used by first creating the constructor HiPerfTimer and then using the methods Start, Stop and Duration:

NAG routine C06PKF calculates the circular convolution or correlation of two complex vectors of period n.

In this example to complex vectors are build up with a period of n (n=5000000) and the correlation and the circular convolution are calculated.

NAG routine G05YKF generates a quasi-random sequence from a log-normal distribution. It must be preceded by a call to one of the initialization routines G05YLF or G05YNF. The number N (N = 10000000) of quasi-random numbers of dimension IDIM (IDIM = 4) are passed and also the mean (IDIM) of the underlying Normal distribution for each dimension and the std (IDIM) standard deviation of the underlying Normal distribution.

NAG Routine E02CAF forms an approximation to the weighted, least-squares Chebyshev series surface fit to data arbitrarily distributed on lines parallel to one independent coordinate axis. It determines a bivariate polynomial approximation of degree k in x and l in y to the set of data points , with weights , for and . That is, the data points are on lines , but the x values may be different on each line. The polynomial is represented in double Chebyshev series form.

N, the number of lines on which data points are given, was set to 1000

K, the required degree of x was set to 100 and

L, the required degree of y was set also to 100.

Also the fitting polynomial was evaluated at the data points using the routine E02CBF.

Fig 1: Timings for the SMP enabled NAG routines with one or two threads

My experience with presenting these examples at different clients has been very positive, but also interesting! “Warum verwenden Sie diese Bibliothek in .NET? Das ergibt doch keinen Sinn, da wir nur in Fortran programmieren!” “Why are you using the Library in .NET and not directly in Fortran? This doesn’t make any sense, we are coding only in Fortran!”

Others were excited to learn and understand they might use their second processor!

My conclusions are as follows

As a non FORTRAN programmer to be able to run some tests with the new NAG Library for SMP and Multicore I decided to choose a few simple tests that I could run using both CPUs of my laptop. I created some .NET console applications, wrappers around the FORTRAN routines that were able to pass all the arguments to the routines in a .NET fashion and implemented a time counting function. There are many parts of the Library parallelized, including quadrature, partial differential equations, interpolation, curve and surface fitting, linear algebra, correlation and regression analysis, multivariate methods, random number generators, time series analysis, sorting and special functions. I chose routines from the correlation, curve and surface fitting and random number generation chapters.

The results were calculated on a Intel® Core™2 Duo Prozessor P8700 (2.53GHz,1066MHz,3MB) machine with Windows Vista as operating system and Microsoft Visual Studio 2008 (32‐bit project) as compiler. Each function was used to solve a large enough problem to allow for parallelism. Each test case was run with one or two threads. The number of threads was set using the system variable ‘OMP_NUM_THREADS’ in the Visual Studio console window.

A ‘QueryPerformanceCounter’ function, was implemented in all the examples to calculate the time which the routines needed for calculating the result. Below is the class implemented for each of the examples which can be used by first creating the constructor HiPerfTimer and then using the methods Start, Stop and Duration:

HiPerfTimer pt = new HiPerfTimer(); pt.Start(); //routine pt.Stop(); pt.Duration(); internal class HiPerfTimer { [DllImport("Kernel32.dll")] private static extern bool QueryPerformanceCounter( out long lpPerformanceCount); [DllImport("Kernel32.dll")] private static extern bool QueryPerformanceFrequency( out long lpFrequency); private long startTime, stopTime; private long freq; // Constructor public HiPerfTimer() { startTime = 0; stopTime = 0; if (QueryPerformanceFrequency(out freq) == false) { throw new Win32Exception(); } } // Start the timer public void Start() { Thread.Sleep(0); QueryPerformanceCounter(out startTime); } // Stop the timer public void Stop() { QueryPerformanceCounter(out stopTime); } // Returns the duration of the timer (in seconds) public double Duration { get { return (double)(stopTime - startTime) / (double)freq; } } }After compiling the examples with the C# compiler option ‘csc’, the number of threads is first set to the value 1 and then set to the value 2. The only result which is printed is the time which the routine needs for calculating the result.

**Correlation**NAG routine C06PKF calculates the circular convolution or correlation of two complex vectors of period n.

In this example to complex vectors are build up with a period of n (n=5000000) and the correlation and the circular convolution are calculated.

C: \C06PKF\C06PKF>csc c06pkf.cs C:\ C06PKF\C06PKF> set OMP_NUM_THREADS=1 C:\C06PKF\C06PKF>c06pkf Duration: 5.74 sec C: \C06PKF\C06PKF>set OMP_NUM_THREADS=2 C: \C06PKF\C06PKF>c06pkf Duration: 2.98 sec

**Random Number Generation**NAG routine G05YKF generates a quasi-random sequence from a log-normal distribution. It must be preceded by a call to one of the initialization routines G05YLF or G05YNF. The number N (N = 10000000) of quasi-random numbers of dimension IDIM (IDIM = 4) are passed and also the mean (IDIM) of the underlying Normal distribution for each dimension and the std (IDIM) standard deviation of the underlying Normal distribution.

C: \ G05YKF\G05YKF>csc c06pkf.cs C: \G05YKF\G05YKF>set OMP_NUM_THREADS=1 C: \G05YKF\G05YKF>g05ykf Duration: 3.52 sec C: \G05YKF\G05YKF>set OMP_NUM_THREADS=2 C:\ G05YKF\G05YKF>g05ykf Duration: 2.06 sec

**Approximation**NAG Routine E02CAF forms an approximation to the weighted, least-squares Chebyshev series surface fit to data arbitrarily distributed on lines parallel to one independent coordinate axis. It determines a bivariate polynomial approximation of degree k in x and l in y to the set of data points , with weights , for and . That is, the data points are on lines , but the x values may be different on each line. The polynomial is represented in double Chebyshev series form.

N, the number of lines on which data points are given, was set to 1000

K, the required degree of x was set to 100 and

L, the required degree of y was set also to 100.

Also the fitting polynomial was evaluated at the data points using the routine E02CBF.

C:\E02CAF\E02CAF >csc e02caf.cs C:\ E02CAF\E02CAF >set OMP_NUM_THREADS=1 C:\ E02CAF\E02CAF >e02caf Duration: 4.77 sec C:\ E02CAF\E02CAF >set OMP_NUM_THREADS=2 C:\ E02CAF\E02CAF >e02caf Duration: 2.48 sec

Fig 1: Timings for the SMP enabled NAG routines with one or two threads

My experience with presenting these examples at different clients has been very positive, but also interesting! “Warum verwenden Sie diese Bibliothek in .NET? Das ergibt doch keinen Sinn, da wir nur in Fortran programmieren!” “Why are you using the Library in .NET and not directly in Fortran? This doesn’t make any sense, we are coding only in Fortran!”

Others were excited to learn and understand they might use their second processor!

My conclusions are as follows

- Never show a C# program to HPC Computer Programmers who love Fortran and think languages such as C are radical!
- NAG Library for SMP and Multicore really is easy to use and for those who are current NAG Fortran Library users the transition really is painless. Users just need to take care of the number of threads used.

My article shows how the NAG SMP library can be used with .NET in exactly the same manner as the NAG serial library. The performance of the SMP library is achieved by careful tuning of the NAG source code using OpenMP to parallelise the code. For the PC environment the SMP library offered uses the Intel Fortran compiler and hence issues Intel OpenMP threads.

ReplyDeleteBecause of this, the SMP library works best in this threading environment or when called from a single thread. Use from a different threading environment can cause difficulties. It is worth noting that my examples were timed and issued from a single .NET thread.

If you are planning to use the NAG SMP library from a multi-threaded .NET program then it we recommend that the environment variable OMP_NUM_THREADS be set to one. This switches off the threading in the SMP library and avoids thread conflict with the .NET threads of the main program.