Showing posts from July, 2015

NAG Linear Regression on Apache Spark

This is a brief summary of a talk I gave recently at the Chicago Apache Spark Users Group Meetup. During the talk, I present many of the problems and successes when using the NAG Library distributed on Apache Spark worker nodes. You can find the slides available here.

The Linear Regression Problem
In this post we test the scalability and performance of using NAG Library for Java to solve a large-scale multi-linear regression problem on Spark. The example data ranges from 2 gigabytes up to 64 gigabytes in the form of

We solve this problem using the normal equations. This method allows us to map the sum-of-squares matrix computation across worker nodes. The reduce phase of Spark aggregates two of these matrices together. In the final step, a NAG linear regression routine is called on the master node to calculate the regression coefficients. All of this happens in one pass over th…