Wednesday, 25 February 2015

Advanced Analytics on Apache Spark

Developed in AMPLab at UC Berkeley, Apache Spark has become an increasingly popular platform to perform large scale analysis on Big Data. With run-times up to 100x faster than MapReduce, Spark is well suited for machine learning applications.

Spark is written in Scala but has APIs for Java and Python. As the NAG Library is accessible from both Java and Python, this allows Spark users access to over 1600 high quality mathematical routines. The NAG Library covers areas such as:
  • Machine Learning including
    • Linear regression (with constraints)
    • Logistic regression (with constraints)
    • Principal Component Analysis (A good article relating Machine Learning and PCA can be found here)
    • Hierarchical cluster analysis
    • K-means
  • Statistics including
    • Summary information (mean, variance, etc)
    • Correlation
    • Probabilities and deviates for normal, student-t, chi-squared, beta, and many more distributions
    • Random number generation
    • Quantiles
  • Optimization including
    • Linear, nonlinear, quadratic, and sum of squares for the objective function
    • Constraints can be simple bounds, linear, or even nonlinear
  • Matrix functions
    • Inversion
    • Nearest correlation
    • Eigenvalues + eigenvectors

Calling the NAG Library on Spark

The fundamental datatype used in Spark is the Resilient Distributed Dataset (RDD). A RDD acts as a pointer to your distributed data on the filesystem. This object has intuitive methods (count, sample, filter, map/reduce, etc) and lazy evaluation that allow for fast and easy manipulation of distributed data.

Below is a simple Python example of using the NAG Library in Spark to calculate the cumulative Normal distribution function on a set of numbers (the message passing output from Spark has been omitted):

SparkContext available as sc
>>>  from nag4py.s import nag_cumul_normal
>>>  myNumbers = sc.parallelize( [-2.0, -1.0, 0.0, 1.0, 2.0] )
>>>  myNumbers.takeSample(False, 5, 0)
[ 0.0, -2.0, -1.0, 1.0, 2.0] 

>>> nag_cumul_normal ).takeSample(False, 5, 0)
[0.5, 0.02275, 0.15865, 0.84134, .97725]

It should be noted that the vast majority of the algorithms employed in the NAG library require all relevant data to be held in memory. This may seem to deviate from the Spark ecosystem, however when working with large datasets, two usage scenarios are commonly seen:
  1. The full dataset is split into subsets, for example a dataset covering the whole world may be split by country, county and city and an independent analysis carried out on each subset. In such cases all the relevant data for a given analysis may be held on a single node and therefore can be processed directly by NAG library routines.
  2. A single analysis is required that utilizes the full dataset. In this case it is sometimes possible to reformulate the problem. For example many statistical techniques can be reformulated as a maximum likelihood optimization problem. The objective function of such an optimization (the likelihood) can then be evaluated using the standard Spark map/reduce functions and the results fed back to one of the robust optimization routines available in the NAG library.

For more information on using the NAG Library in Spark or any of the other topics touched upon in this article please contact NAG at