Thursday, 9 July 2015

NAG Linear Regression on Apache Spark

This is a brief summary of a talk I gave recently at the Chicago Apache Spark Users Group Meetup. During the talk, I present many of the problems and successes when using the NAG Library distributed on Apache Spark worker nodes. You can find the slides available here.

The Linear Regression Problem


In this post we test the scalability and performance of using NAG Library for Java to solve a large-scale multi-linear regression problem on Spark. The example data ranges from 2 gigabytes up to 64 gigabytes in the form of

label x1 x2 x3 x4
68050.9 42.3 12.1 2.32 1
87565 47.3 19.5 7.19 2
65151.5 47.3 7.4 1.68 0
78564.1 53.2 11.4 1.74 1
56556.7 34.9 10.7 6.84 0

We solve this problem using the normal equations. This method allows us to map the sum-of-squares matrix computation across worker nodes. The reduce phase of Spark aggregates two of these matrices together. In the final step, a NAG linear regression routine is called on the master node to calculate the regression coefficients. All of this happens in one pass over the data - no iterative methods needed!

Example output from the linear regression is the following:

Time taken for NAG analysis: 236.964 seconds
Number of Independent Variables: 5
Total Number of Points: 334800000
R-Squared = 0.699
Var Coef SE t-value
Intcp 12723.3 2.2 5783.3
0 989.7 0.02 47392.5
1 503.4 0.04 11866.5
2 491.1 0.1 4911.9
3 7859.1 0.9 8732.3

*************************************************
Predicting 4 points
Prediction: 57634.9 Actual: 60293.6
Prediction: 32746.6 Actual: 35155.5
Prediction: 49917.5 Actual: 52085.3
Prediction: 82413.2 Actual: 82900.3

Timings


Using the above method, the NAG linear regression algorithm is able to compute the exact values for the regression coefficients, t-values, and errors. Below is a log-log plot of the Runtimes vs. Size of Input data for the NAG routines on an 8 slave Amazon EC2 cluster.





Not only is it important the NAG routines run fast, but that they scale efficiently as you increase the number of worker nodes. Let's look at how the 16 GB data set scales as we vary the number of slaves.







Comparison To MLlib


We tried running the MLlib Linear Regression algorithm on the same data, but were unable to get meaningful results. The MLlib algorithm using the stochastic gradient descent to find the optimal coefficients, but the last stochastic steps always seemed to return NaNs (we would be happy to share sample data ...let us know if you can solve the problem!).

More Information


For more information including examples of the NAG Library on Spark, contact us at support(at)nag.com

Thursday, 25 June 2015

Helping primary school children achieve the best in their SATs

My wife Caroline is a teacher in a primary school near where we live in Buckinghamshire. She teaches year 6, that is to say, ten to eleven year old children.

Now, year 6 is a rather important year in British schools. Children following the National Curriculum in state-funded schools are subject to SATs (standard assessment tests) before they move on to secondary school, and the results of these tests are used to create the dreaded league tables which are supposed to help parents make an informed choice about which school to send their children to. Importantly, the number of pupils can affect the amount of funding a school gets.

Year 6 teachers are therefore under a lot of pressure to help their pupils achieve good SATs results, so extra sessions and booster groups are common. This year Caroline asked if I could go into school to spend a few hours with her top maths set - ones that it had been decided were good enough to be entered for the harder (Level 6) papers.

I'm used to teaching adults on NAG training courses, but this was the first time I'd been into a primary school to teach. It turned out to be good fun! I had four students, all of whom had already passed their "eleven plus" and would be going to grammar school. Topics we covered included straight line graphs, probability and algebra. The kids were enthusiastic, and not too shy. Many of the problems tackled turned out to involve rearrangement of equations, so drilling into them that if you do the same thing to both sides of an equation (with some caveats of course) then the equality still holds seemed quite important.

Whether the short time I spent with them actually made any difference is hard to know, but I really enjoyed the interaction and the children seemed to also. The SATs results are not yet out, and scoring a Level 6 in Mathematics or English is quite difficult (the average 11 year old is expected to achieve Level 4) - so there will be no dishonour in not achieving it, but I'm looking forward to hearing how they get on. And I'm hoping that next year, as part of NAG's Community Outreach Programme, I'll be able to go into school again, and maybe for more hours this time around.

Thursday, 18 June 2015

Wilkinson Prize for Numerical Software 2015

by Mike Dewar, Chief Technical Officer, Numerical Algorithms Group (NAG)


James Hardy Wilkinson
NAG, along with Argonne National Laboratory and the National Physical Laboratory (NPL), awards the Wilkinson Prize every four years to the authors of an outstanding piece of numerical software. The prize honours the pioneering work of Dr James Hardy Wilkinson in the field of numerical algorithms, work which had a fundamental impact on NAG in our early years and which still resonates today. The Wilkinson Prize is unique in that it is awarded on the basis of the quality of a piece of software, not just for the mathematics behind it, and the judges consider the documentation and test suites as well as the software itself.  

This year the winners are P.E. Farrell (University of Oxford), S.W. Funke (Simula Research Laboratory), D.A. Ham (Imperial College London), and M.E. Rognes (Simula Research laboratory) for the development of dolfin-adjoint, a package which automatically derives and solves adjoint and tangent linear equations from high-level mathematical specifications of finite element discretisations of partial differential equations. The prize will be presented at ICIAM in Beijing in August, during the SIAM Awards Lunch. 


More details about the Wilkinson Prize and past winners can be found here.   

Thursday, 21 May 2015

Introducing the team: Gail Austin, Customer Support Coordinator, Team Leader

Gail, describe your role at NAG?

I manage the Technical Support Service, the Sales Operations Team, NAG Oxford Reception and I am website editor for the marketing area of the website.

Can you give examples of the customer problems you help solve via NAG Technical Support?

Rather than solve users technical issues my role is connected to the direction of incoming support queries and the issuing and renewal of software licences. Over time I have come to understand the installation of our software and so can help users with some of their licensing and software applicability issues.

Tell us something special/unusual about NAG?

I believe that NAG truly cares about giving our customers the best possible service.  This comes across again and again in the positive feedback we receive following the closure of technical support queries.

Tell us something special/unusual about yourself? 

I’m just a little bit proud of the fact that when helping Mick Pont with some tests he was carrying out for a presentation of NAG’s Random Number Generator, I was found to be more random!

Are you attending any exhibitions or conferences this year?

I’m not, I’m too busy in the office, helping our customers to get the best from NAG.

What’s been the highlight of your time working at NAG so far?

That’s a difficult one, I don’t think I can put my finger on any one thing.  I think it’s enough to say that I hope to stay at NAG for the remainder of my career, I love my job here and I consider myself really lucky to have finally found my niche, I’m as at home here as I am ……… at home :-)

Wednesday, 29 April 2015

Professor Mike Powell F.R.S

We are pained to pass on the sad news of the death on 19th April of Prof Mike Powell F.R.S, a brilliant numerical analyst specialising in numerical optimisation and approximation theory.  

Mike touched many lives with his work. NAG and NAG's community benefitted greatly because he gave freely his code to NAG from the very beginning. The more mature may remember the routine E04DCF, an implementation of his hybrid method for unconstrained optimisation. More recently his work on derivative-free optimisation led to E04JCF/e04jcc entering our Libraries as implementations of BOBYQA.

As a founder member of the IMA, Mike took an interest in the IMANA Newsletter and encouraged NAG to contribute regularly to that until the Newsletter's role was subsumed by the internet.  

The Wikipedia page http://en.wikipedia.org/wiki/Michael_J._D._Powell gives more detail of Mike's many achievements.
 
He will be greatly missed by NAG, especially past and present colleagues who had the pleasure of knowing him personally. We pass on our love, condolences and very best wishes to his family.

 

Wednesday, 15 April 2015

Introducing the team: Craig Lucas, Senior Technical Consultant


Craig, describe your role at NAG?

I am a Senior Technical Consultant based in NAG’s Manchester Office. Here I manage part of NAG’s High Performance Computing group. I come from a numerical linear algebra background. For my MSc dissertation I looked at nearest correlation matrix problems, something I still work on today, and this year I have an MSc student looking at more new algorithms.  Another big interest of mine is shared memory parallelism with OpenMP, and in particular helping users to better exploit it.

Can you give examples of the customer problems you help solve via NAG Technical Support?

Technical Support is a great way to interact with our users. Queries can range from something very simple, helping people link their program to the NAG Library for the first time, to an in-depth discussion about the accuracy of an optimization routine. It is often challenging and always an opportunity to learn something new about the mathematics of the Library. Sometimes I need to read our documentation to help me get standard!

Tell us something special/unusual about NAG?

Working at NAG is very collaborative, whether that is working with universities or with other NAG colleagues. One day I might be listening to a Manchester student talking about their new research, another I might be at the whiteboard in the office discussing the nuances of a new NAG algorithm.

Tell us something special/unusual about yourself?

I came to Mathematics relatively late and was 27 when I went to university. I had somehow drifted into Civil Engineering as a teenager, working in the area of land reclamation. Much of my work was dealing with closed coal mines; this was Staffordshire in the late 1980’s. Perhaps I had some sort of epiphany, I certainly realised that a childhood interest needed some proper exploration.  I will always be grateful to Graham Bowtell at City University who decided to take a risk on a mature student without any proper mathematics qualifications.

Are you attending any exhibitions or conferences this year?

My next conference is The University of Manchester SIAM Student Chapter Conference 2015. This annual event is another great opportunity to hear what young researchers are working on. I co-judge the Best talk and Best Poster prizes. It is a nice way to give back to a community that NAG really relies on.

Thursday, 26 March 2015

Introducing the team: Mick Pont, NAG Principal Technical Consultant


Over the next few months, scattered in between our technical blog posts, we are going to publish interviews with NAG colleagues. Our first interview is with Mick Pont.

Mick, what is your role at NAG? 

I'm a Principal Technical Consultant, and Deputy Manager of the Development Division.

I'm involved in the development and peer review of new NAG Library software and documentation, and in the scheduling of software production in line with company targets.

In "project management" speak, I'm named as "Executive" for the NAG Mark 26 Library (the NAG Library, Mark 25 is due for launch in April 2015) project, which means that I'm supposed to "provide support and advice to the project manager, and ensure that the project outcome is good and provides NAG with increased value".

I also produce the NAG Technical Support rotas (most of the NAG development team are on one or other of these rotas regularly). Handling support calls is a good way to get to know our customers, understand the issues they face and help solve them.

I build and test most of the Windows-based libraries that NAG produces, but I'm an operating system agnostic - I like to use Linux and other systems too.

I often go on customer site visits with my sales colleagues, as technical backup.

I also teach various training courses for NAG software, including the NAG Toolbox for MATLAB® and the NAG Library; these training days usually take place at customer sites.

Can you give examples of the customer problems you help solve when you’re in your support role?

Helping users get their NAG software and licence keys installed correctly; helping them migrate from one version of the NAG Library to another; advising on the best NAG routine to solve a particular problem.

Tell us something special/unusual about life at NAG?

We try to celebrate all of the world's festivals where appropriate. Icelandic cream bun day is a big favourite of mine.

Tell us something special/unusual about yourself?

I'm a solipsist, and I choose not to believe in this question.

What industry events are you going to be attending this year?

Lots of training days at customer sites around the UK, and later in the year I will be attending SC15 in Austin, Texas.

Thursday, 5 March 2015

Optimization, NAG and Me - 5 Years and Counting

Authored by Jan Fiala, NAG Numerical Software Developer

It feels almost like yesterday since I joined NAG so it is hard to believe that it has been 5 years already. Looking back, I see a long path I would like to tell you a bit about, just in case you were curious about one of the people answering your support queries.

I was always a blend of a computer scientist and a mathematician. Computers and programming were my hobby but it did not feel quite right to choose either as the main subject at university so I picked mathematics. Thereafter began my hunt to get as close to computers as possible. Numerical analysis and mathematical programming/optimization were my lucky answer.

I really enjoyed my university years, however an academic career is not for everyone. The necessity to write papers ("publish or perish"), going through a couple of postdocs before getting a tenure, fund raising, teaching and going ever deeper and deeper into a very narrow research field are some of the disadvantages. I love to learn, play and to hack things. To make them work. So, when I got the opportunity, I joined NAG.

NAG is a great place to work if you find scientific computing interesting. There is a very long history, in fact, the first NAG Library was distributed to the users even before I was born! NAG today is not confined to numerical software (NAG Libraries). The organisation has expanded to numerical and HPC services, compilers and more. However my job still consists mainly of work on the Libraries, developing new optimization solvers and maintaining the existing ones.

The best thing is the challenge -- you need get your hands dirty in various fields: from specialized solvers for derivative free optimization, through classical convex optimization to nonlinear semidefinite programming. I went through small pet projects (have you read my AMPL tutorial) through bespoke versions of our solvers tailored to special customers' data to a 50000 lines beast, a new semidefinite programming solver which will be released soon.

In addition, you mustn't forget to help the customers via our friendly support desk, write documentation and examples and keep an eye on the up-to-date algorithmic developments. I can still attend conferences as before, only the focus has changed from an academic to a more pragmatic approach. The relative freedom of the focus is almost limitless, as my ideas contribute to the development roadmap and thence to the future development. I won't lie to you. The time was not all green and great -- writing test programs or bug fixes can be tedious but they have to be done. At least it makes you look forward to the next new challenge which is waiting just around the corner.

If my story sounds familiar, you might be pleased to know that NAG is expanding our optimization team. So ladies and gentlemen, why are you waiting? Apply now and help us make the next big project happen.

Wednesday, 25 February 2015

Advanced Analytics on Apache Spark

Developed in AMPLab at UC Berkeley, Apache Spark has become an increasingly popular platform to perform large scale analysis on Big Data. With run-times up to 100x faster than MapReduce, Spark is well suited for machine learning applications.

Spark is written in Scala but has APIs for Java and Python. As the NAG Library is accessible from both Java and Python, this allows Spark users access to over 1600 high quality mathematical routines. The NAG Library covers areas such as:
  • Machine Learning including
    • Linear regression (with constraints)
    • Logistic regression (with constraints)
    • Principal Component Analysis (A good article relating Machine Learning and PCA can be found here)
    • Hierarchical cluster analysis
    • K-means
  • Statistics including
    • Summary information (mean, variance, etc)
    • Correlation
    • Probabilities and deviates for normal, student-t, chi-squared, beta, and many more distributions
    • Random number generation
    • Quantiles
  • Optimization including
    • Linear, nonlinear, quadratic, and sum of squares for the objective function
    • Constraints can be simple bounds, linear, or even nonlinear
  • Matrix functions
    • Inversion
    • Nearest correlation
    • Eigenvalues + eigenvectors

Calling the NAG Library on Spark

The fundamental datatype used in Spark is the Resilient Distributed Dataset (RDD). A RDD acts as a pointer to your distributed data on the filesystem. This object has intuitive methods (count, sample, filter, map/reduce, etc) and lazy evaluation that allow for fast and easy manipulation of distributed data.

Below is a simple Python example of using the NAG Library in Spark to calculate the cumulative Normal distribution function on a set of numbers (the message passing output from Spark has been omitted):

SparkContext available as sc
>>>  from nag4py.s import nag_cumul_normal
>>>  myNumbers = sc.parallelize( [-2.0, -1.0, 0.0, 1.0, 2.0] )
>>>  myNumbers.takeSample(False, 5, 0)
[ 0.0, -2.0, -1.0, 1.0, 2.0] 

>>>  myNumbers.map( nag_cumul_normal ).takeSample(False, 5, 0)
[0.5, 0.02275, 0.15865, 0.84134, .97725]


It should be noted that the vast majority of the algorithms employed in the NAG library require all relevant data to be held in memory. This may seem to deviate from the Spark ecosystem, however when working with large datasets, two usage scenarios are commonly seen:
  1. The full dataset is split into subsets, for example a dataset covering the whole world may be split by country, county and city and an independent analysis carried out on each subset. In such cases all the relevant data for a given analysis may be held on a single node and therefore can be processed directly by NAG library routines.
  2. A single analysis is required that utilizes the full dataset. In this case it is sometimes possible to reformulate the problem. For example many statistical techniques can be reformulated as a maximum likelihood optimization problem. The objective function of such an optimization (the likelihood) can then be evaluated using the standard Spark map/reduce functions and the results fed back to one of the robust optimization routines available in the NAG library.

For more information on using the NAG Library in Spark or any of the other topics touched upon in this article please contact NAG at support@nag.com.

Thursday, 22 January 2015

Adding a Slider Widget to Implied Volatility

In the last post on Implied Volatility, we downloaded real options data from the CBOE and calculated the volatility curves/surface. We saw the calculations of 30,000 implied volatilities in roughly 10 seconds. 

In this post we concentrate on the speed of calculating implied volatility via a variety of different methods. We look at the volatility curve/surface using Python's Scipy, the NAG Library for Python, and the NAG C Library. In addition, we've added a slider widget to the Python graphs from before to see the real-time effects of changing the interest and dividend rates (see the video below). All the code can be downloaded to produce the graphs, and a NAG license is not required for the case using scipy.optimize.fsolve.

The script and utility methods can be downloaded from here. The script begins by generating sample option prices. These are fed through different root finding methods (chosen by the user) to back out the implied volatilities. 

The methods tested include:
1) scipy (scipy.optimize.fsolve) - A wrapper around the hybrd and hybrj algorithms in the MINPACK Fortran library, followed by a Python Black-Scholes formula.
2) nag4py (The NAG Library for Python) - A wrapper (nag_zero_cont_funct_brent) around Brent's method for the root-finding, followed by nag_bsm_price for the Black-Scholes formula.
3) ctypes (The NAG C Library, Mark 23) - The same NAG functions as (2), but the looping and calculations are done directly in C, rather than through the nag4py wrapper layer. This requires building a shared library on your own machine, which is then loaded from the main script.

Running the Script

You can run the script using the following command, where a_method is one of {scipy, nag4py, ctypes}:
$ python implied_volatility.py --method a_method

Note that option (3) (ctypes) requires you to build a shared library. The build command for Windows, Linux and Mac can be found below. NAGCDIR should be the directory of your NAG C Library installation. The C code (nag_imp_vol.c) is included in the download.

Linux
    gcc -Wl,--no-undefined -fPIC -shared nag_imp_vol.c -INAGCDIR/include NAGCDIR/lib/libnagc_nag.so -o nag_imp_vol.so

Mac
    gcc -fPIC -shared nag_imp_vol.c -INAGCDIR/include NAGCDIR/lib/libnagc_nag.dylib -o nag_imp_vol.dylib

Windows
    cl /LD /DLL /MD -I"NAGCDIR\include" nag_imp_vol.c /link /LIBPATH:"NAGCDIR\lib" "NAGCDIR\lib\nagc_nag_MD.lib"

Running the script with --method ctypes produces the following result:



Timing the methods

We've added a timer around the first call when calculating the implied volatilities. You can also change n_strikes in implied_volatility.py to alter the total number of calculations. The base case uses 50 strikes, 5 expirations, and 2 option types (Call and Put) for a total of N = 50 * 5 * 2 = 500 implied volatilities.

The approximate times in seconds of each method are displayed below. (N is the total number of implied volatilities calculated.)






N scipy nag4py NAG C Library
500 1.40 0.26 0.008
1000 2.84 0.50 0.016
2000 5.74 1.00 0.020
4000 11.40 1.99 0.033
Some notes:
  • While scipy.fsolve looks considerable slower than nag4py, this is not the case. The differences in speed are a result of calling a pure python Black-Scholes formula vs. using nag4py's nag_bsm_price function.
  • The times for fsolve and nag4py scale somewhat linearly, while the NAG C shared library doesn't. I suspect this is due to the overhead of preparing the data into numpy arrays before calling the shared library.
  • If you encounter dependencies issues with Scipy or Matplotlib, we recommend switching to a Python distribution such as Anaconda or Canopy.
  • The code uses serial implementations in NAG and Python. You could look at a more advanced version that uses multi-threading or call a function that solves a system of equations (scipy.optimize.root or nag_zero_nonlin_eqns_easy).
  • The above analysis is a very oversimplified model of implied volatility. In practice, one should use a more complex model and look at other root-finding methods (such as the rational approximation).
  • You can also increase the number of calculations past 4000, but Python seems to have trouble plotting and updating the graphs.

Thursday, 2 October 2014

Problem needed for research on Bermudan Option Pricing Algorithms

Introduction

NAG together with Prof. Oosterlee and an MSc student from TU Delft are investigating the recent Stochastic Grid Bundling Method (SGBM) [1,2]. The objective is to compare the performance of SGBM to the well-known Longstaff-Schwartz (least squares method or LSM) in a non-academic setting, i.e. on the pricing of a Bermudan option, with underlying asset(s) driven by a realistic process such as Heston or LMM. We are looking for an interesting case to test these two methods. This includes the type of option, the underlying processes and any other important features or details.

Outline

The well known LSM by Longstaff-Schwartz[3] is the industry standard for pricing multi-dimensional Bermudan options by simulation and regression. LSM is based on the regression now principle, whereas the Stochastic Grid Bundling Method (SGBM) by Jain and Oosterlee applies regression later in order to get more accurate approximations. However, this limits us to apply SGBM to processes where an analytical or approximate expression of the discounted moments are available.

Another advantage of SGBM is its regression on bundles instead of the whole data set in a further attempt to decrease the regression error, and SGBM allows the computation of the Greeks at almost no extra cost.

Numerical results show that, compared to LSM, a higher accuracy can be obtained at comparable computational time. However, these tests were performed on academic problems, using geometric Brownian motion for the underlying assets.

Very recent research (Feng and Oosterlee, Sept 2014) extended SGBM for stochastic volatility and stochastic interest rate dynamics (Heston-Hull-White). This led to the possibility of comparing the performance of SGBM in a non-academic setting and is the cause for our research. We therefore wish to compare LSM and SGBM on a problem which is interesting and relevant to industry. Realistically, handling all the complexities of traded products will probably require more time than we have (around two months), so ideally we seek a problem which captures all the salient features and allowing us to see whether SGBM outperforms LSM.

Industry Involvement

We would like some advice in defining the product to be valued, especially regarding dimensionality, process, correlation structure, payoff and exercise features. Any additional industry involvement will be light: perhaps a few emails to clarify details, a conf call or face to face meeting.

For correspondence or further details please email support@nag.co.uk

References

[1] S. Jain and C. W. Oosterlee, "The Stochastic Grid Bundling Method: Efficient Pricing of Bermudan Options and their Greeks,'' papers, SSRN, Sept. 2013.
[2] S. Jain and C. W. Oosterlee, "Pricing high-dimensional Bermudan options using the stochastic grid method,'' International Journal of Computer Mathematics, 89(9):1186-1211, 2012.
[3] F. Longstaff and E. Schwartz,  "Valuing American Options by Simulation: a Simple Least-squares Approach,''  Review of Financial Studies, vol. 14, no. 1, pp. 113-147, 2001.