Showing posts from February, 2015

Advanced Analytics on Apache Spark

Developed in AMPLab at UC Berkeley, Apache Spark has become an increasingly popular platform to perform large scale analysis on Big Data. With run-times up to 100x faster than MapReduce, Spark is well suited for machine learning applications.

Spark is written in Scala but has APIs for Java and Python. As the NAG Library is accessible from both Java and Python, this allows Spark users access to over 1600 high quality mathematical routines. The NAG Library covers areas such as:
Machine Learning includingLinear regression (with constraints)Logistic regression (with constraints)Principal Component Analysis (A good article relating Machine Learning and PCA can be found here)Hierarchical cluster analysisK-meansStatistics includingSummary information (mean, variance, etc)CorrelationProbabilities and deviates for normal, student-t, chi-squared, beta, and many more distributionsRandom number generationQuantilesOptimization includingLinear, nonlinear, quadratic, and sum of squares for the object…