Monday 30 June 2014

Week of 6/23 Update

This week I implemented fast polynomial interpolation and evaluation. As I mentioned last week the plan is to compute $det(A)$ for a polynomial matrix $A$ by evaluating $det(A(p))$ at a bunch of points $p$ and interpolating the polynomial that runs through those values. If the largest polynomial in $A$ has degree $d$ then we need $d+1$ points. The naive algorithm for evaluating a polynomial takes $O(d)$ time per point for a total of $O(d^{2})$ field operations. Since the polynomials from block-wiedemann have degree of order $\frac{n}{b}$ (here $n$ is the matrix dimension and very big, $b$ is the blocking factor and fairly small) we definitely needed a faster method. Although we can't do an FFT per se since our fields lack points with the requisite properties, there is a similar algorithm (Modern Computer Algebra, algorithms 10.5 and 10.11) that works for any choice of points. The new PolyInterpolation class can transform a polynomial from the coefficient domain to the point domain and back in $O(d\log^{2}{d})$ time. So far it seems to work pretty well, but it still needs a little polishing up.

I also refactored the code I'd been using to compute invariant factors and wrapped it up in its own class called CoppersmithInvariantFactors. It was getting quite complicated and this should make it significantly easier to use for other applications. I also fixed some old Linbox bugs and cleaned up a matrix representation I'd been using for testing called sparse-map-map-matrix. It's now a fully fledged SparseMatrix type.

The plan now is to automatically extend the base field as needed when doing polynomial interpolation/evaluation so that there are sufficient points to represent the original polynomial. Once I've got that working I'll pull it all together into the determinant class and I'll be able to use that for the SNF computation. My original plan was to write some kind of a wrapper for the polynomial ring GivaroPoly but now I think I might be able to use a different SNF class (SmithFormIliopoulos) that already supports doing computations modulo $det(A)$.

Wednesday 25 June 2014

Python Implementation of The Block-Wiedemann Algorithm

I've created a repository for my Python implementation of the Block-Wiedemann and scalar Wiedemann algorithms. It's available here. I must warn anyone interested in the code that it was never intended for production use and is both buggy and horrendously slow. Nevertheless, it has the advantage of being much simpler than the Linbox implementation, and may be useful to anyone trying to understand how the algorithm works.

Monday 23 June 2014

Week of 6/16 Update

This week I worked on improving parallel performance and on computing the Smith Normal Form (SNF) of matrix min-polys. I've finally got the entire pipeline of code up and running. Given a matrix supplied in Matrix Market Format I have a single program that will read it, compute the matrix min-poly in parallel, change the representation of the min-poly to a matrix entries over a polynomial ring, compute the SNF of that matrix and return the main diagonal. This particular program is in the Linbox repository under benchmarks/invariant-factors-benchmark. So far the results look good, with the last entry being the min-poly and the others looking like the correct invariant factors.

While this is a big step forward there's still a lot that needs to be done. In particular the algorithm I'm using for computing the SNF (The Kannan-Bachem algorithm) works great for matrices over the integers, but is horrendously slow for polynomial rings. I'm 90% sure that the problem is that it's generating intermediate values that are polynomials of enormous degree, and spending way too much time manipulating those. Fortunately, there's a way around this blow-up in polynomial degree, and that's to use the fact that all the computations of the SNF of $A$ can be done modulo $det(A)$. Unfortunately Linbox doesn't have a good way of finding the determinant of a polynomial matrix right now so that's my next step. Once I've got $det(A)$, I'll either extend the Kannan-Bachem class or just write a wrapper for the polynomial ring class that does everything modulo a given polynomial.

My basic strategy for computing $det(A)$ is to evaluate $A$ at $n$ points (over a field extension if there aren't enough distinct points in the base field), then compute $n$ determinant of these scalar matrices and finally interpolate the polynomial that goes through those points. This shouldn't be too hard, but the interpolation step might be a bit tricky since the usual way to do this (Inverse FFT) requires a root of unity with an order that's greater than $n$ and also a power of 2. This isn't always easy to find over an arbitrary field so I'm looking into alternative algorithms with less strict requirements for the interpolation points.

Monday 16 June 2014

Week of 6/9 Update

This week I worked on generating benchmarking results. The super-linear speed-up I observed went away when I ran tests on other multicore machines so I'm going to chalk up the strange results to some quirk of cache behavior. I'm still seeing very good performance from the parallel code with around 50-70% efficiency for most test cases when using 8 cores.

I've done some tests on the effects of block size b on the performance of Block-Wiedemann. As expected, the degree of the matrix min-poly of a sparse, random, $n\times{}n$ matrix $A$ is about $\frac{n}{b}$. Since the degree is roughly proportional to the number of iterations of the algorithms, this means that as $b$ increases we need to do fewer loops to converge. However the amount of work done per loop increases with $b$. Benchmarking shows that the optimal choice of $b$ to compute the min-poly is small, around 5 for $n=5000$. The opportunity for parallelism increases with $b$. With enough cores the increased parallelism is just about enough to cancel out the additional work introduced by choosing a large value for $b$. This is important because when computing invariant factors it's necessary to take a much larger $b$. This will likely also be the case for computing the min-poly of certain pathological matrices. All these results look quite promising so far as they suggest parallel programming is the way to go for finding invariant factors.

Monday 9 June 2014

Week of 6/2 Update

This week I sorted out most of the issues I'd been having when using block-coppersmith-domain (Yuhasz's Block-Wiedemann implementation). The biggest problem was that delta, a parameter which basically determines the maximum possible degree of the candidate min-poly, was being set in something of a kludgey way. I committed a patch which chooses delta large enough that it should always find the min-poly and implements early termination. The idea behind early termination is that if the discrepancy of the candidate min-poly has been zero for O(1) iterations of the algorithms then there's a good chance that it's the true minimal polynomial. For all but the most pathological cases we can perservere for 20 or so iterations and be basically certain that the min-poly has been found. Preliminary testing suggests that this technique works extremely well and finds the min-poly for large matrices quite quickly.

I finished my Python implementation of Block-Wiedemann and confirmed that it gives the same results as the Linbox version. It was educational to write and may be of use later for testing of the parallel code. I'll post the code in a later update in case someone finds it useful. (Update: The code is available here)

I've started parallelizing the Block-Wiedemann code and have some results for what steps are the most time consuming. Computing the discrepancy, generating blocks and updating the min-poly (by matrix-matrix multiplication) all take significant time. The choice of block size b and number of non-zeroes nnz determine which is the most time-consuming step. Fortunately these can all be easily parallelized and I've already gotten great results. I'm hesitant to declare victory though since something fishy is going on. After checking all the obvious mistakes (faulty timers, unrealistically bad sequential code, etc.) I'm seeing consistent, super-linear speed-up in one particular part of the parallel code. When updating the min-poly I get a speed-up of more than 2 when going from one processor to 2. It might just be some funny cache performance thing, but I'll investigate further to see if there isn't something wrong with the sequential code.

Monday 2 June 2014

Week of 5/26 Update

This week mentor and Linbox contributor Brice Boyer from NCSU came to deliver 2 talks on computational linear algebra. While he was here we discussed how to implement benchmarking in Linbox. There are currently two classes that handle benchmarking: one extensive plotting and data-recording class that Brice wrote, and a simplistic CSV serializer that I wrote. We decided that the best thing to do would be to combine the two classes so there's one consistent interface for all benchmarking in Linbox. This will be important later once the parallel code is written and tuning can begin.

I've been working on getting a sequential version of the code needed to compute the matrix min-poly up and working. In theory this should be trivial, since there are two sequential implementation of the Block-Wiedemann algorithm already in Linbox. Unfortunately, it seems there are some bugs in George Yuhasz's version (described in his thesis here) and it was failing to work for small fields, such as GF(3). Right now I'm trying to track down the bug that's causing this. There's another implementation of Block-Wiedemann, the sigma-basis code, that I've still yet to look at in any great detail. At the same time I've been working on a reference implementation of Block-Wiedemann of my own, written in Python. My hope is that I can use it to test and debug the results of the two efficient version already in Linbox.

Although my original plan was to focus solely on Block-Wiedemann, Prof. Saunders raised another option for parallelism. In the scalar Wiedemann algorithm computations are done over an extension field GF(q^n) to solve a matrix over the base field GF(q). The computations in GF(q^n) are potentially very easily parallelizable. With this in mind, my hope is that I can improve the existing mechanism for generating extension fields, and then perform those operations in parallel. Ideally I'll be able to compare these results to those from Block-Wiedemann, but it may end up being more of a backup plan if the Block-Wiedemann solvers turn out to be more buggy than expected.