Monday, 11 August 2014

Week of 8/4 Update

This week I've been working on on fleshing out the algorithm I described last week for computing the last invariant factor of a polynomial matrix. I've written a high-level implementation in Sage to demonstrate proof of concept. There are a few subtleties I've uncovered:
  • The degree of $\bar{x}$, $N$, needs to be bigger than I originally thought. A safe bound is $N\geq{}2n*deg(A)+deg(B)+1$ although I'm not yet sure whether this bound is sharp.
  • The degree of $\vec{b}$, $d_{b}$, can be fairly small, about $2\log_{q}n$. This ensures that the components are probably distinct.
  • When computing $\frac{n_{i}}{d}$ such that $\frac{n_{i}}{d}\equiv{}\bar{x_{i}}(mod\;x^{N})$ there's one such for any given choice of $k=deg(d)$. However only one such choice of $k$ will actually solve the system. To figure out which choice is correct I run the extended euclidean algorithm for several rows of $\bar{x}$ simultaneously and mark values of $d$ that occur in more than one row as candidates.
    • Given a candidate $d$ the other $n_{i}$'s can be determined quickly by dividing $\frac{\bar{x}_{i}}{d}$
  • Any particular $\bar{x}_{i}$ has a chance of producing the wrong $d$ due to cancellation in the Cramer's rule formula $\bar{x}_{i}=\frac{|A_{i}|}{|A|}$. This occurs with probability $\frac{1}{q}$. Checking more $\bar{x}_{i}$'s for candidate $d$'s fixes this. To ensure that we get the correct $d$ as a candidate with probability $p$ we need to check $qbinom(p,n,1-\frac{1}{q})$ where qbinom is the binomial quantile function.
  • Sometimes there will be no $\alpha$ such that $|A(-\alpha)|\neq{}0$, even if $A$ is non-singular (over the rational functions). In this case we can still find an answer by finding a random solution to the Block-Toeplitz system: $\begin{pmatrix}A_{d_{b}}&A_{d_{b-1}}&...&A_{0}&0&...\\A_{d_{b+1}}&A_{d_{b}}&...&A_{0}&0&...\\&&\ddots{}&&\\0&...&A_{d_{A}}&A_{d_{A}-1}&...&A_{0}\end{pmatrix}\begin{pmatrix}\bar{x}^{(0)}\\\bar{x}^{(1)}\\\vdots{}\\\bar{x}^{N}\end{pmatrix}=\vec{0}$
 Next week I'll start wrapping things up as the Google Summer of Code draws to a close.

Monday, 4 August 2014

Week of 7/28 Update

This week I've started looking into an alternative way of computing the modulus for the polynomial ring used by SmithFormIliopolus. The problem with PolyDet is that it's painfully slow, mostly due to its heavy use of arithmetic over field extensions. The new approach would avoid the use of extensions and is based on a paper by Dixon (Exact Solutions of Linear Equations Using P-Adic Expansions (1982)). The difference is that here we're working with polynomial, not p-adic representations of integers. The goal is to compute the largest invariant factor of a matrix A. For our purposes, this is actually better than the determinant since it keeps our intermediate polynomials smaller. Here's the algorithm:
  • Choose a random polynomial vector $\vec{b}$ over the ground field $\mathbb{F}$
    • The degree of $\vec{b}$ (defined as the maximum of the degrees of the elements of $\vec{b}$) should be large enough to sample the range of $A$, but how big exactly remains unclear. 
  • Choose $\alpha$ to make $A(-\alpha)$ non-singular then let $A\leftarrow{}A(x-\alpha)$
  • Solve $A\vec{x}=\vec{b}$ for the first $2n$ terms of a power series solution for $\vec{x}$ for $n$ the sum of the degrees of each row of $A$
  • Perform a rational reconstruction to solve $\vec{x}=\frac{\vec{n}}{d}$ for $d$
    • Take a few random linear combinations of the elements of $\vec{x}$
    • For each linear combination find $f$ and $g$ of minimal degree such that $g\vec{x}\equiv{}f$ (mod $x^{2n}$)
    • Take the least common multiple of the various $g$'s
  • Return the polynomial $d$
Asymptotically this algorithm require about the same number of field operations as PolyDet. However, because it only works over the ground field I'm hoping it will be much faster. I'm working on a Sage example of this algorithm now to work out the details. Hopefully I'll then be able to get a Linbox implementation up and working to compare against PolyDet.

Monday, 28 July 2014

Week of 7/21 Update

This week I've been working on making my PolyDet code useful by creating a polynomial ring where all the operations are done modulo another polynomial. I've called this new class GivaroPolyModPoly. It reuses the Givaro Extension class which already does something similar to represent extension field elements modulo an irreducible. All I do is manually set the "irreducible" to some arbitrary polynomial. Now I can supply this ring to SmithFormIliopolus and all the intermediate polynomials will be automagically reduced modulo the polynomial I computed with PolyDet.

I've also been working on implementing parallelism in the Pascal blackbox. Some people on the FFLAS-FFPACK team have written a wrapper for task-based parallelism that we've been thinking of integrating with Linbox. The advantage to task-based parallelism is that it allows for recursive uses of parallelism, whereas things like #omp for do not. In addition the wrapper would allow us to switch to other backends such as Kaapi. I've been looking in to using these wrappers to handle parallel applyLeft() in Pascal.

Monday, 21 July 2014

Week of 7/14 Update

This week I focused on speeding up computations over GF(3). We have a number of matrices over GF(3) for which we want to compute the invariant factors. Fortunately, Bryan Youse has written a very efficient representation of these matrices called Sliced3. It uses 2 bits to represent each matrix element and operates over n/2 elements in parallel when computing a matrix-matrix multiplication where n is the word size. The code has been developed in parallel with the mainline Linbox source so the interface has diverged a bit from the rest of the Linbox matrix classes. Also, in order to speed up the critical applyLeft() method I had to specialize for this representation so that I use iterators directly rather than constructing temporary submatrices.

I further improved what I'm now calling my Pascal blackbox class which represents matrices of the form $A_{i,j}=c_{i+j}{i+j\choose{}j}$ with elements in GF(3). It turns out it's not necessary to be fast when computing ${n\choose{}k}$, instead you can take advantage of the fractal structure of this matrix. A $3^{n}\times{}3^{n}$ matrix $B_{i,j}={i+j\choose{}j}$ consists of 9 $3^{n-1}\times{}3^{n-1}$ submatrices, $B_{i,j}$ where $B_{1,1}=B_{1,2}=B_{1,3}=B_{2,1}=B_{3,1}$, $B_{2,2}=-A_{1,1}$ and $A_{2,3}=A_{3,3}=A_{3,2}=0$. This gives rise to a natural recursive implementation of applyLeft() which proves to be about 10 times fast when used with the Sliced3 package than my old generic sparse matrix implementation. It also only uses $O(n)$ memory instead of $O(n^{\frac{4}{3}})$

Monday, 14 July 2014

Week of 7/7 Update

This week I've been trying to speed up PolyDet, and to get it working with SmithFormIliopolus. Testing has revealed that, while asymptotically fast, the PolyDet code is way slower than any other code in the invariant factor pipeline. The bottleneck is at the evaluation stage, where we take $b^{2}$ polynomials of degree $d$ and evaluate them at about $bn$ points. By contrast the determinant and interpolation steps are relatively cheap. I suspect that the problem is in the GivaroPoly implementation of polynomial arithmetic. GivaroPoly and GivaroExtension extension both represent a polynomial by a std::vector. This means that a polynomial over an extension field is a vector of vectors and I suspect the doubly nested loops involved are killing performance. I'm looking into FLINT which also implements polynomial arithmetic in pure C and may be faster. Some other Linboxers have also been trying out FLINT and it sounds promising. I'm also looking into automatically switching to a Zech-log implementation which the extension field is sufficiently small, but this probably won't work for very large matrices with correspondingly large values of $d$ (roughly proportional to $\frac{n}{b}$).

I've also written a blackbox class which represents a matrix which we've been trying to get the invariant factors of. It has the structure of Pascal's triangle and the $(i,j)$ element is given by $c_{i+j}{i+j\choose{}j}$. It turns out that, with a bit of dynamic programming, ${n\choose{}k}$ can be computed in just 2 field operations using $O(n)$ space for precomputation. I took advantage of this and the regularity of the matrix to come up with a very space efficient representation that should help for the larger cases.

Monday, 7 July 2014

Week of 6/30 Update

This week I finished implementing the PolyDet class which computes the determinant of a matrix over a polynomial ring. The basic steps of the algorithm are:
  • Given: $F=GF(p)$, $A\in{}F^{n*n}[x]$
  • Let $d=\sum_{i=1}^{i=m}\max_{j\in\{1..n\}}\deg(A_{i,j})$
    • (Choose $d$ to be the sum of the maximum row degrees)
  • Let $d\leftarrow{}\min\left\{k|2^{k}>d+1\right\}$
    • (Add 1 to $d$ and round up to the nearest power of 2)
  • Let $EF=GF(p^{e})$ where $p^{e}=\min{\left\{k|\geq{}d\right\}}$
  • Map the elements of $A$ to $EF$ homomorphicaly
  • Choose d distinct points $p_{i}$ in $EF$
  • Compute $\forall{}i\in\left\{1..d\right\}:v_{i}=det(A(p_{i}))$
  • Interpolate the polynomial $P(x)$ of minimal degree that fits the points $v_{i}$
  • Map the coefficients of $P(x)$ back to the ground field and return
This was a real bear to get working primarily because it involved working so much with extension fields. I used the GivaroExtension class for this and it turns out that it's still fairly buggy. A lot of algorithms in Linbox don't work for extension fields because of subtle differences in how the interfaces work. In addition, currently you can't construct an extension of an extension using GivaroExtension, you need a different extension class such as GivaroZpz which has some limitations of its own. I'm hoping to clean up some of these issues in the near future.

Having gotten polynomial determinants working the next step is to adapt the Smith Normal Form code to use this functionality. My plan is to implement a ring class that will automatically perform operations mod the determinant polynomial $P(x)$. The SmithFormIliopolus code was intended to work in a similar manner, albeit over the integers. With luck I'll be able to reuse the field extension code which already does everything mod a polynomial, and just tell it to use an arbitrary polynomial instead.

I'm also looking into using chinese remaindering to compute the invariant factors of integer matrices. With luck this should be as simple as wrapping the existing finite field code with a call to the existing CRADomain code, and performing a quick verification step. It should also provide more opportunities for parallelism since this step is embarrassingly parallel.

Monday, 30 June 2014

Week of 6/23 Update

This week I implemented fast polynomial interpolation and evaluation. As I mentioned last week the plan is to compute $det(A)$ for a polynomial matrix $A$ by evaluating $det(A(p))$ at a bunch of points $p$ and interpolating the polynomial that runs through those values. If the largest polynomial in $A$ has degree $d$ then we need $d+1$ points. The naive algorithm for evaluating a polynomial takes $O(d)$ time per point for a total of $O(d^{2})$ field operations. Since the polynomials from block-wiedemann have degree of order $\frac{n}{b}$ (here $n$ is the matrix dimension and very big, $b$ is the blocking factor and fairly small) we definitely needed a faster method. Although we can't do an FFT per se since our fields lack points with the requisite properties, there is a similar algorithm (Modern Computer Algebra, algorithms 10.5 and 10.11) that works for any choice of points. The new PolyInterpolation class can transform a polynomial from the coefficient domain to the point domain and back in $O(d\log^{2}{d})$ time. So far it seems to work pretty well, but it still needs a little polishing up.

I also refactored the code I'd been using to compute invariant factors and wrapped it up in its own class called CoppersmithInvariantFactors. It was getting quite complicated and this should make it significantly easier to use for other applications. I also fixed some old Linbox bugs and cleaned up a matrix representation I'd been using for testing called sparse-map-map-matrix. It's now a fully fledged SparseMatrix type.

The plan now is to automatically extend the base field as needed when doing polynomial interpolation/evaluation so that there are sufficient points to represent the original polynomial. Once I've got that working I'll pull it all together into the determinant class and I'll be able to use that for the SNF computation. My original plan was to write some kind of a wrapper for the polynomial ring GivaroPoly but now I think I might be able to use a different SNF class (SmithFormIliopoulos) that already supports doing computations modulo $det(A)$.

Wednesday, 25 June 2014

Python Implementation of The Block-Wiedemann Algorithm

I've created a repository for my Python implementation of the Block-Wiedemann and scalar Wiedemann algorithms. It's available here. I must warn anyone interested in the code that it was never intended for production use and is both buggy and horrendously slow. Nevertheless, it has the advantage of being much simpler than the Linbox implementation, and may be useful to anyone trying to understand how the algorithm works.

Monday, 23 June 2014

Week of 6/16 Update

This week I worked on improving parallel performance and on computing the Smith Normal Form (SNF) of matrix min-polys. I've finally got the entire pipeline of code up and running. Given a matrix supplied in Matrix Market Format I have a single program that will read it, compute the matrix min-poly in parallel, change the representation of the min-poly to a matrix entries over a polynomial ring, compute the SNF of that matrix and return the main diagonal. This particular program is in the Linbox repository under benchmarks/invariant-factors-benchmark. So far the results look good, with the last entry being the min-poly and the others looking like the correct invariant factors.

While this is a big step forward there's still a lot that needs to be done. In particular the algorithm I'm using for computing the SNF (The Kannan-Bachem algorithm) works great for matrices over the integers, but is horrendously slow for polynomial rings. I'm 90% sure that the problem is that it's generating intermediate values that are polynomials of enormous degree, and spending way too much time manipulating those. Fortunately, there's a way around this blow-up in polynomial degree, and that's to use the fact that all the computations of the SNF of $A$ can be done modulo $det(A)$. Unfortunately Linbox doesn't have a good way of finding the determinant of a polynomial matrix right now so that's my next step. Once I've got $det(A)$, I'll either extend the Kannan-Bachem class or just write a wrapper for the polynomial ring class that does everything modulo a given polynomial.

My basic strategy for computing $det(A)$ is to evaluate $A$ at $n$ points (over a field extension if there aren't enough distinct points in the base field), then compute $n$ determinant of these scalar matrices and finally interpolate the polynomial that goes through those points. This shouldn't be too hard, but the interpolation step might be a bit tricky since the usual way to do this (Inverse FFT) requires a root of unity with an order that's greater than $n$ and also a power of 2. This isn't always easy to find over an arbitrary field so I'm looking into alternative algorithms with less strict requirements for the interpolation points.

Monday, 16 June 2014

Week of 6/9 Update

This week I worked on generating benchmarking results. The super-linear speed-up I observed went away when I ran tests on other multicore machines so I'm going to chalk up the strange results to some quirk of cache behavior. I'm still seeing very good performance from the parallel code with around 50-70% efficiency for most test cases when using 8 cores.

I've done some tests on the effects of block size b on the performance of Block-Wiedemann. As expected, the degree of the matrix min-poly of a sparse, random, $n\times{}n$ matrix $A$ is about $\frac{n}{b}$. Since the degree is roughly proportional to the number of iterations of the algorithms, this means that as $b$ increases we need to do fewer loops to converge. However the amount of work done per loop increases with $b$. Benchmarking shows that the optimal choice of $b$ to compute the min-poly is small, around 5 for $n=5000$. The opportunity for parallelism increases with $b$. With enough cores the increased parallelism is just about enough to cancel out the additional work introduced by choosing a large value for $b$. This is important because when computing invariant factors it's necessary to take a much larger $b$. This will likely also be the case for computing the min-poly of certain pathological matrices. All these results look quite promising so far as they suggest parallel programming is the way to go for finding invariant factors.

Monday, 9 June 2014

Week of 6/2 Update

This week I sorted out most of the issues I'd been having when using block-coppersmith-domain (Yuhasz's Block-Wiedemann implementation). The biggest problem was that delta, a parameter which basically determines the maximum possible degree of the candidate min-poly, was being set in something of a kludgey way. I committed a patch which chooses delta large enough that it should always find the min-poly and implements early termination. The idea behind early termination is that if the discrepancy of the candidate min-poly has been zero for O(1) iterations of the algorithms then there's a good chance that it's the true minimal polynomial. For all but the most pathological cases we can perservere for 20 or so iterations and be basically certain that the min-poly has been found. Preliminary testing suggests that this technique works extremely well and finds the min-poly for large matrices quite quickly.

I finished my Python implementation of Block-Wiedemann and confirmed that it gives the same results as the Linbox version. It was educational to write and may be of use later for testing of the parallel code. I'll post the code in a later update in case someone finds it useful. (Update: The code is available here)

I've started parallelizing the Block-Wiedemann code and have some results for what steps are the most time consuming. Computing the discrepancy, generating blocks and updating the min-poly (by matrix-matrix multiplication) all take significant time. The choice of block size b and number of non-zeroes nnz determine which is the most time-consuming step. Fortunately these can all be easily parallelized and I've already gotten great results. I'm hesitant to declare victory though since something fishy is going on. After checking all the obvious mistakes (faulty timers, unrealistically bad sequential code, etc.) I'm seeing consistent, super-linear speed-up in one particular part of the parallel code. When updating the min-poly I get a speed-up of more than 2 when going from one processor to 2. It might just be some funny cache performance thing, but I'll investigate further to see if there isn't something wrong with the sequential code.

Monday, 2 June 2014

Week of 5/26 Update

This week mentor and Linbox contributor Brice Boyer from NCSU came to deliver 2 talks on computational linear algebra. While he was here we discussed how to implement benchmarking in Linbox. There are currently two classes that handle benchmarking: one extensive plotting and data-recording class that Brice wrote, and a simplistic CSV serializer that I wrote. We decided that the best thing to do would be to combine the two classes so there's one consistent interface for all benchmarking in Linbox. This will be important later once the parallel code is written and tuning can begin.

I've been working on getting a sequential version of the code needed to compute the matrix min-poly up and working. In theory this should be trivial, since there are two sequential implementation of the Block-Wiedemann algorithm already in Linbox. Unfortunately, it seems there are some bugs in George Yuhasz's version (described in his thesis here) and it was failing to work for small fields, such as GF(3). Right now I'm trying to track down the bug that's causing this. There's another implementation of Block-Wiedemann, the sigma-basis code, that I've still yet to look at in any great detail. At the same time I've been working on a reference implementation of Block-Wiedemann of my own, written in Python. My hope is that I can use it to test and debug the results of the two efficient version already in Linbox.

Although my original plan was to focus solely on Block-Wiedemann, Prof. Saunders raised another option for parallelism. In the scalar Wiedemann algorithm computations are done over an extension field GF(q^n) to solve a matrix over the base field GF(q). The computations in GF(q^n) are potentially very easily parallelizable. With this in mind, my hope is that I can improve the existing mechanism for generating extension fields, and then perform those operations in parallel. Ideally I'll be able to compare these results to those from Block-Wiedemann, but it may end up being more of a backup plan if the Block-Wiedemann solvers turn out to be more buggy than expected.

Saturday, 24 May 2014

Week of 5/19 Update

This week I got a simple Linbox application written that calls the existing Block-Wiedemann implementation block-coppersmith-domain. This implementation is one of two in Linbox, and seems to be more amenable to parallelism. I'm now working on generating some test matrices that I can run through this code to ensure that I continue to generate the right results when I start adding OpenMP #pragmas.

I've also written a reference implementation of (non-blocked) Wiedemann's algorithm in Python along with some simple finite field libraries to accompany it. I hope to add a Block-Wiedemann implementation as well that I can use to try out algorithmic changes before I change the much more complicated Linbox version.

Monday, 31 March 2014

LinBox Code Sample

As part of my project application, the Lmonade developers asked for a
small code sample. I've posted it below. The code is very simple and
not suitable for production use but will compile and run, and helps to
illustrate the LinBox interface and some basic OpenMP use.

The program multiplies two matrices together in parallel using a
single round of Strassens algorithm to split up the work.

Known Bugs:

  - To keep things simple there are no recursive calls, so only a
    maximum of 7 cores will be used.

  - strassen() doesn't check to ensure that the input matrices are
    square and 2n*2n.

  - The locking mechanism could be made significantly more efficient
    with some finer grained locks.


Introductions


This blog will contain status updates on my Google Summer of Code
(GSoC) project "Parallel Computation of Matrix Invariant Factors in
LinBox".

My plan is to implement an algorithm for determining the invariant
factors of a matrix by the Block-Wiedemann algorithm. Some key
details:

  - All of this is being done in LinBox, so a lot of
    the scaffolding is already in place.

  - The input matrix A will be big, sparse and populated by elements
    drawn from a finite field.

  - The real focus of this project is not just to compute invariant
    factors, but to do so fast. To help that along I'm going to use
    shared memory parallelism via OpenMP. Making things work
    efficiently in parallel is almost always harder than you expect,
    so this will most likely be the most time-consuming part of the
    project.

More specifics can be found in my proposal here, and new posts will
appear here as the project progresses.