Routines TestedThe EC2 instance used was a High-CPU, c1.xlarge instance (7GB of memory and 8 Virtual Cores). The SMP Library contains 204 tuned and a total of 337 enhanced routines for use on Multicore machines from which I tested f11me(Sparse Matrix Factorization) and g02bn(Kendall/Spearman rank coefficient). While I had a maximum of 8 cores available, I decided to increase the number of threads beyond this, just to see the result. Below you will see how the time taken scales as the number of threads increases (click to enlarge):
Both routines scale well as you increase the number of threads, but f11me takes longer with 12 threads as opposed to 8! I suspect the slowdown is a result of dependencies between threads. In order for some to start, they may have to wait on other parts of the matrix factorization to finish. G02bn on the other hand doesn't require communication between threads so each one can run independently. This routine slightly benefited from running on 12 vs. 8 threads.