### Matrix Functions in Parallel

Last year I wrote a blog post
about NAG’s work on parallelising the computation of the matrix square root. More
recently, as part of our Matrix Functions Knowledge Transfer Partnership with
the University of Manchester, we’ve been investigating parallel implementations
of the Schur-Parlett algorithm [1].

Most algorithms for computing
functions of matrices are tailored for a specific function, such as the matrix
exponential or the matrix square root. The Schur-Parlett algorithm is much more
general; it will work for any “well behaved” function (this general term can be
given a more mathematically precise meaning). For a function such as

using the Schur-Parlett
algorithm should be quicker than computing the individual components
separately. This algorithm made its
first appearance in Mark 23 of the NAG C Library. Our parallel improvements will
be found in Mark 24 of the C Library, next year.

At the heart of the algorithm lies
a Taylor series (which is truncated when it has converged sufficiently). For
example:

Before this however, the
matrix A is converted into upper triangular Schur form. This serves two
purposes. It reduces the number of arithmetic operations required to evaluate
the Taylor series. It also gives us the eigenvalues of A. The Taylor series
will converge fastest when the eigenvalues lie close together. The clever bit
of the algorithm is to rearrange the Schur form of A, grouping similar
eigenvalues together. The diagonal part of the matrix is then split into
multiple different sized sub-matrices according to this grouping. A Taylor
series is evaluated for each of these diagonal blocks. The off-diagonal blocks
are then computed by solving the ‘Parlett
recurrence’. This is all very confusing and is best explained by the diagram
below:

Here we can see the original
matrix (in black), the triangular Schur form (in red), the blocking strategy
(in blue) and finally the computation of the off-diagonal blocks (in green). The
exact distribution of eigenvalues and therefore the number and size of diagonal
blocks is totally dependent on the input matrix.

So, what does this mean for
parallelism? Clearly, all BLAS calls e.g. matrix multiplications can be done in
parallel. But on a higher level, the diagonal blocks can be computed in parallel and
then the Parlett recurrence can also be solved, a block superdiagonal at a time, in parallel.

To investigate this higher
level parallelism, I constructed various upper triangular 1000 by 1000 test
matrices, choosing their eigenvalue distributions carefully. Here is a very brief selection of
results from a machine with 8 available threads:

- 1 block of size 1000: no speed up available
- 20 blocks of size 50: 5x speed up on 8 threads compared to 1 thread
- 1000 blocks of size 1: 5x speed up on 8 threads compared to 1 thread

These results are not
surprising. For a matrix with only one diagonal block (all eigenvalues close
together), nothing can be done in parallel. For matrices with more separated
eigenvalues (and therefore multiple diagonal blocks) the evaluation of the
diagonal blocks and subsequently the Parlett recurrence (which typically accounts for well over 90% of the run time) and can be done in parallel.

In the future, nested
parallelism is likely to become more and more important, and this is something
which still needs investigating. Would there come a point whereby it is useful
to turn off the threaded BLAS and divert all resources to higher level
parallelism, and vice versa? We don’t know yet.

I concluded my last post with
the message that the best algorithm for serial architectures may not be the
best algorithm for parallel architectures. I think the general point to take
from this post is that the performance gain you get from parallelism can be
highly dependent on your input data.

[1] P. I. Davies and N. Higham. A Schur-Parlett algorithm for computing matrix functions. SIAM J. Matrix Anal. Appl., 25:464-485, September 2003.

Can you make any comments on the work efficiency?

ReplyDeleteNice post! Do you have any insight as to how this might scale for dozens of cores? I'm thinking of Xeon Phi for instance.

ReplyDeleteWe've not tried it on the Xeon Phi yet but hope to soon. I think the scaling will be heavily dependent on the input data i.e. how many groups of eigenvalues there are. It will be interesting to find out for sure. Cheers.

ReplyDelete