SeqAlign

Hack your Genome!

This project is a bit of a departure from my previous work in that, *gasp*, it might actually be somewhat useful (to someone, somewhere, maybe). This is the SeqAlign, an FPGA-based accelerator for the two common DNA Sequence alignment algorithms: Smith-Waterman (for local alignments) and Needleman-Wunsch (for global alignments). What’s DNA, you ask? The sourcecode of life =)

Now, let’s say we happen to wander across  a DNA sequence one day. “Hey there, Mr. ACTTTCGACA! What are you used for?” A DNA sequence by itself isn’t very useful, but what if we could find similar sequences in other animals (or even other places in our own genome!)? Maybe we could get clues to its function. Or maybe even figure out which DNA sequences might have evolved from other  sequences (Go Darwin!). This sounds like a job for . . . COMPUTATIONAL GENOMICS!!!

For the last decade or so, scientists have been sequencing the DNA of everything they can get their hands on. Proteins, chromosomes, even the entire human genome! There are huge databases full of A’s, G’s, C’s and T’s out there, just waiting to be searched. The common way of searching is to take our “query” sequence, and figure out where it lines up best with every other sequence in the database. Most of these will be junk, but every once in a while we might find an almost-identical gem of a sequence hidden in somewhere in the database. These matches can give us clues to both its role and possibly its evolutionary history. One problem, though – for the most accurate search methods, these searches can be *slow.* Really slow. For those computer scientists out there, we’re talkin’ O(n^2) slow.

The two algorithms used for searching, Smith-Waterman and Needleman-Wunsch, are examples of “dynamic programming” algorithms. Imagine we line up our two DNA sequences, the query sequence and a random entry from our database, along the top and side of a matrix.

sw-matrix1

Using our “dynamic programming” algorithm, we can calculate the “score” of each cell in that matrix based on the scores of the neighboring cells, like so:

dynamic-programmingIt gets a bit complicated, but basically, if the letters at that position in each sequence match, the score goes up. If they don’t match, the score goes down. We move through the matrix one cell at a time and progressively calculate an “alignment score” for both sequences. For an n-long query sequence, and an m-long database sequence, that means we have to do m*n calcuations! With big sequences that can take an awfully long time! How can we speed this up?

Parallelism to the rescue! To calculate a cell, we only need to know the three neighbors to the north, west and northwest. As we fill in the matrix, we can calculate more and more of the cells in parallel, calculating them in a “diagonal wave” across the matrix:

diagonal_waveNow most computers only have one processor, so we can’t really take advantage of this. Fortunately, wonderful little devices called Field-Programmable Gate Arrays exist (see my Non-Von1 for another fun example), that let you make your own computer circuitry. Imagine if we could build an array of processors, connected in a line, where each processor holds one “nucleotide” from the top sequence. We could then “feed” in the sequence from one side, and calculate the entire matrix just like in the picture above! This is called a “systolic array“, and is an awesome way to compute a matrix!

systolic_array

This project used my trusty Digilent Nexys2 FPGA board (of Non-Von1 fame) to implement a systolic array of processors for computing the final alignment scores between two DNA sequences. It even uses the high-ish speed USB interface (about 5.5 Megabytes/sec)!!

nexys2_400

But what about the speed?!

This is a popular problem for benchmarking and such, since there are obvious ways to speed it up (and it’s actually fairly heavily used in practice). Performance is measured in “Cell-updates-per-second,” or CUPS. For comparison, the Cell processor inside the PS3 can calculate a peak of about 12 GCUPS (that’s billion CUPS!). A think a modern GPU can get up to about 18-20 GCUPS, and a single desktop processor core can probably calculate somewhere from .5-2 GCUPS. What about the performance of my implementation?

A systolic array can compute 1 cell-update per “processing element” per clock cycle:

Peak CUPS = (# of PEs) * (processor frequency)

The simplest version of my design (using linear gap penalties) was able to cram 92 PEs into my FPGA board, and run at a blistering 50 MHz. This has a theoretical peak of (92 * 50*10^6) = 4.6 GCUPS!! Not too shabby for a cheapo FPGA board and two weeks worth of effort.

“But what about in practice? That USB interface sounds like it sucks!”

The USB interface to this board does indeed suck. It takes less than 5 microseconds to compute the whole matrix, but initiating a driver call from windows to send a byte to the device takes a whopping 370 microseconds!! Oy! Fortunately, there are tricks you can play! Thanks to the use of FIFOs, buffering and the creative application of backpressuring, I was able to sustain about 1.8 GCUPS on this bad boy! All from a board that’s about 5″ x 5″, and is powered entirely off of a USB port (2.5W max!). Hello green computing!

Does it scale?

This is the “will it blend?” question of the computer architecture world. Sure your fancy design works for small examples, but can you scale it? You better believe it! The “critical path,” or longest delay path between any to parts of the circuit, is the path between two PEs. If you add more PEs, that path never gets any longer. Remembering our equation for calculating performance, doubling the number of PEs means doubling the performance – we get perfect linear scaling!

We happen to have a couple of spare FPGA boards lying around at work that we used for ASIC emulation. These things have 3 Xilinx Virtex-4 chips, the bad boys of the FPGA world, per board. Each chip can hold about 10 times as much logic as my Spartan3, and clock at a much higher frequency. Some back-of-the-envelope calculations put performance for a single board, even using the much more flexible “affine gap penalties,” somewhere north of 120 GCUPS. Take that NVIDA!

Sounds fun. What about the source?

In the interest of sharing knowledge, I actually started a project for this one at opencores.org.

Check out: The SeqAlign project

It includes the Verilog source code for the processing elements, the top-level file to parameterize your array, and a simple testbench. The PEs support both local (smith waterman) and global (needleman-wunsch) alignment scores, along with affine gap penalties.

I imagine no one will ever actually use this, but who knows =)

In case anyone has gotten this far and finds themselves with an unquenchable thirst for more knowledge, here is the final_report (pdf) I wrote about this project for the class I was taking.