Thursday, 21 August 2014

Gaussian Mixture Model

With the release of Mark 24 of the NAG C Library comes a plethora of new functionality including matrix functions, pricing Heston options w/term structure, best subset selection, and element-wise weightings for the nearest correlation matrix.

Among the new routines I was excited to test out was the Gaussian mixture model (g03ga). This routine will take a set of data points and fit a mixture of Gaussians for a given (co)variance structure by maximizing the log-likelihood function. The user inputs the (co)variance structure, number of groups, and (optionally) the initial membership probabilities.

I decided to test out this new functionality, which is also in Mark 24 of the NAG Toolbox for MATLAB. Often I will use MATLAB with the NAG Toolbox before switching to C++ and the NAG C Library for my production code. So I generated some data and tried the routine to see if it could find the covariance structure. You can download the script and try it out for yourself here. The example will generate the test data, run the NAG Gaussian mixture routine and plot the results. An example of the output is given below:


The blue points are the generated data, while the red and yellow ovals show the covariance structure output from NAG Gaussian mixture model (the ovals are contours of ~0.60 density for their respective groups).

While running the example a couple times and re-sampling through the starting values for the initial membership probabilities, I noticed what I thought to be unusual behavior for the routine. Namely, the Gaussian mixture model algorithm isn't able to identify the Gaussian mixtures. The function would occasionally converge to the below structure (run the above script and click 'Resample' 3 times):


It appears the routines has converged to local extrema of the likelihood function. This happens as a result of randomizing the initial membership probabilities.

Since we have the power of the NAG Library at our disposal, I've added a K-means clustering option in the above script to initialize the membership probabilities to a particular cluster before being input into the Gaussian mixture model.


My colleagues tell me that k-means can also get stuck in a local minima and exhibit this 'wrong' behavior as well, thus one should always be careful with initial allocations - luckily the NAG Library provides a generally acceptable default allocation as an option! Many thanks to Martyn Byng and Stephen Langdell for comments on this post.