Last year I wrote a blog post
about NAG’s work on parallelising the computation of the matrix square root. More
recently, as part of our Matrix Functions Knowledge Transfer Partnership with
the University of Manchester, we’ve been investigating parallel implementations
of the Schur-Parlett algorithm [1].

Most algorithms for computing
functions of matrices are tailored for a specific function, such as the matrix
exponential or the matrix square root. The Schur-Parlett algorithm is much more
general; it will work for any “well behaved” function (this general term can be
given a more mathematically precise meaning). For a function such as

using the Schur-Parlett
algorithm should be quicker than computing the individual components
separately. This algorithm made its
first appearance in Mark 23 of the NAG C Library. Our parallel improvements will
be found in Mark 24 of the C Library, next year.

At the heart of the algorithm lies
a Taylor series (which is truncated when it has converged sufficiently). For
example:

Before this however, the
matrix A is converted into upper triangular Schur form. This serves two
purposes. It reduces the number of arithmetic operations required to evaluate
the Taylor series. It also gives us the eigenvalues of A. The Taylor series
will converge fastest when the eigenvalues lie close together. The clever bit
of the algorithm is to rearrange the Schur form of A, grouping similar
eigenvalues together. The diagonal part of the matrix is then split into
multiple different sized sub-matrices according to this grouping. A Taylor
series is evaluated for each of these diagonal blocks. The off-diagonal blocks
are then computed by solving the ‘Parlett
recurrence’. This is all very confusing and is best explained by the diagram
below:

Here we can see the original
matrix (in black), the triangular Schur form (in red), the blocking strategy
(in blue) and finally the computation of the off-diagonal blocks (in green). The
exact distribution of eigenvalues and therefore the number and size of diagonal
blocks is totally dependent on the input matrix.