Matrix Functions in Parallel

Last year I wrote a blog post about NAG’s work on parallelising the computation of the matrix square root. More recently, as part of our Matrix Functions Knowledge Transfer Partnership with the University of Manchester, we’ve been investigating parallel implementations of the Schur-Parlett algorithm [1].
Most algorithms for computing functions of matrices are tailored for a specific function, such as the matrix exponential or the matrix square root. The Schur-Parlett algorithm is much more general; it will work for any “well behaved” function (this general term can be given a more mathematically precise meaning). For a function such as
using the Schur-Parlett algorithm should be quicker than computing the individual components separately.  This algorithm made its first appearance in Mark 23 of the NAG C Library. Our parallel improvements will be found in Mark 24 of the C Library, next year.
At the heart of the algorithm lies a Taylor series (which is truncated when it has converged sufficiently). For example:
Before this however, the matrix A is converted into upper triangular Schur form. This serves two purposes. It reduces the number of arithmetic operations required to evaluate the Taylor series. It also gives us the eigenvalues of A. The Taylor series will converge fastest when the eigenvalues lie close together. The clever bit of the algorithm is to rearrange the Schur form of A, grouping similar eigenvalues together. The diagonal part of the matrix is then split into multiple different sized sub-matrices according to this grouping. A Taylor series is evaluated for each of these diagonal blocks. The off-diagonal blocks are then computed by solving the ‘Parlett recurrence’. This is all very confusing and is best explained by the diagram below:
Here we can see the original matrix (in black), the triangular Schur form (in red), the blocking strategy (in blue) and finally the computation of the off-diagonal blocks (in green). The exact distribution of eigenvalues and therefore the number and size of diagonal blocks is totally dependent on the input matrix.

So, what does this mean for parallelism? Clearly, all BLAS calls e.g. matrix multiplications can be done in parallel. But on a higher level, the diagonal blocks can be computed in parallel and then the Parlett recurrence can also be solved, a block superdiagonal at a time, in parallel.
To investigate this higher level parallelism, I constructed various upper triangular 1000 by 1000 test matrices, choosing their eigenvalue distributions carefully. Here is a very brief selection of results from a machine with 8 available threads:
  • 1 block of size 1000: no speed up available
  • 20 blocks of size 50: 5x speed up on 8 threads compared to 1 thread
  • 1000 blocks of size 1: 5x speed up on 8 threads compared to 1 thread
These results are not surprising. For a matrix with only one diagonal block (all eigenvalues close together), nothing can be done in parallel. For matrices with more separated eigenvalues (and therefore multiple diagonal blocks) the evaluation of the diagonal blocks and subsequently the Parlett recurrence (which typically accounts for well over 90% of the run time) and can be done in parallel.
In the future, nested parallelism is likely to become more and more important, and this is something which still needs investigating. Would there come a point whereby it is useful to turn off the threaded BLAS and divert all resources to higher level parallelism, and vice versa? We don’t know yet.
I concluded my last post with the message that the best algorithm for serial architectures may not be the best algorithm for parallel architectures. I think the general point to take from this post is that the performance gain you get from parallelism can be highly dependent on your input data.
[1] P. I. Davies and N. Higham. A Schur-Parlett algorithm for computing matrix functions. SIAM J. Matrix Anal. Appl., 25:464-485, September 2003.


  1. Can you make any comments on the work efficiency?

  2. Nice post! Do you have any insight as to how this might scale for dozens of cores? I'm thinking of Xeon Phi for instance.

  3. We've not tried it on the Xeon Phi yet but hope to soon. I think the scaling will be heavily dependent on the input data i.e. how many groups of eigenvalues there are. It will be interesting to find out for sure. Cheers.


Post a Comment

NAG moderates all replies and reserves the right to not publish posts that are deemed inappropriate.

Popular posts from this blog

Implied Volatility using Python's Pandas Library

C++ wrappers for the NAG C Library

ParaView, VTK files and endianness