The Linear Regression Problem
label | x1 | x2 | x3 | x4 | |
68050.9 | 42.3 | 12.1 | 2.32 | 1 | |
87565 | 47.3 | 19.5 | 7.19 | 2 | |
65151.5 | 47.3 | 7.4 | 1.68 | 0 | |
78564.1 | 53.2 | 11.4 | 1.74 | 1 | |
56556.7 | 34.9 | 10.7 | 6.84 | 0 |
We solve this problem using the normal equations. This method allows us to map the sum-of-squares matrix computation across worker nodes. The reduce phase of Spark aggregates two of these matrices together. In the final step, a NAG linear regression routine is called on the master node to calculate the regression coefficients. All of this happens in one pass over the data - no iterative methods needed!
Example output from the linear regression is the following:
Example output from the linear regression is the following:
Time taken for NAG analysis: 236.964 seconds
Number of Independent Variables: 5
Total Number of Points: 334800000
R-Squared = 0.699
Var | Coef | SE | t-value |
Intcp | 12723.3 | 2.2 | 5783.3 |
0 | 989.7 | 0.02 | 47392.5 |
1 | 503.4 | 0.04 | 11866.5 |
2 | 491.1 | 0.1 | 4911.9 |
3 | 7859.1 | 0.9 | 8732.3 |
*************************************************
Predicting 4 points
Prediction: 57634.9 Actual: 60293.6
Prediction: 32746.6 Actual: 35155.5
Prediction: 49917.5 Actual: 52085.3
Prediction: 82413.2 Actual: 82900.3
Timings
Using the above method, the NAG linear regression algorithm is able to compute the exact values for the regression coefficients, t-values, and errors. Below is a log-log plot of the Runtimes vs. Size of Input data for the NAG routines on an 8 slave Amazon EC2 cluster.
Not only is it important the NAG routines run fast, but that they scale efficiently as you increase the number of worker nodes. Let's look at how the 16 GB data set scales as we vary the number of slaves.
Comparison To MLlib
We tried running the MLlib Linear Regression algorithm on the same data, but were unable to get meaningful results. The MLlib algorithm using the stochastic gradient descent to find the optimal coefficients, but the last stochastic steps always seemed to return NaNs (we would be happy to share sample data ...let us know if you can solve the problem!).
More Information
For more information including examples of the NAG Library on Spark, contact us at support(at)nag.com