Monday, August 4, 2008

Dynamic Programming

At an interview I had a few years back I was given a question on Dynamic Programming. Fortunately I had written a diff program recently and was familiar with the term. Although the term has a broad meaning, the interviewer was looking for a particular approach to a particular problem. Let me outline that.

Given a sequence of objects, a subsequence is derived by deleting zero or more objects from the sequence. If our sequence is (a b c d e), one subsequence is (a b e), another is (b d). Given two sequences there exists at least one and possibly several sequences that are common to both. The longest common subsequence is of interest in programs like diff.

There is an obvious recursive way to find the length of the longest common subsequence:
(define (lcs seq1 seq2)
  (cond ((or (null? seq1) (null? seq2)) 0)
        ((eq? (car seq1) (car seq2)) 
         (+ 1 (lcs (cdr seq1) (cdr seq2))))
        (else (let ((t1 (lcs seq1 (cdr seq2)))
                    (t2 (lcs (cdr seq1) seq2)))
                (if (> t1 t2)
                    t1
                    t2)))))
.
Informally, we look at the head of each sequence. If the heads match, then that element is common. If not, then either we should drop the head of the first or we should drop the head of the second. We try both and pick the longer one.

This is not a very efficient algorithm. There is a lot of repeated computation. To save time and computation, we have a trick. We create an N by M array to hold the intermediate results of the recursive calls. Along the top of the array we put sequence1 in reverse order. Along the left side we put sequence 2 in reverse order. Now we can fill out the cells in the array. In the first row and the first column we put zero. This corresponds to the first clause in our conditional. In the remaining cells we do this: if the object at the top of the column matches the object in the left of the row, we add 1 to the value in the cell diagonally up and to the left of us. Otherwise, we take the maximum of the cell immediately to the left and the cell immediately above. We fill the array from left to right, top to bottom until we get to the end cell which now contains the answer.

This is a cute trick, but it is unnecessarily complicated. If you instead take the simple, obvious recursive solution and just memoize the results as they are computed, the result will be computed in the same order and with the same performance improvement as with generating an explicit table. In the interview I had, the interviewer didn't know that and was expecting me to write out a table of intermediate results. He kept hinting at it until I had to tell him that simply adding memoization to the recursive solution would yield the same answer without having to reconfigure the problem to work from bottom up in a table.

The next speedup to the Bayesian Hierarchical Clustering comes from this exact trick. When we compute the values of ph1-log and px-log for a subtree, we simply memoize them in the tree structure itself. The next time we need the value we can simply fetch it from the stored location. We can make this even more efficient by computing the value at the time of tree creation so we don't have to check if the value is memoized — it will always be present in the tree structure. If we change the code in this way, we'll see something amusing. Although conceptually we are computing the values top-down as we build the tree, we are, through memoization, computing the values in bottom-up order.

Memoization also helps significantly for log-bernoulli-beta. Because we are building our clusters from bottom up, a significant amount of time is spent checking small clusters and leaf nodes. This means that the two final arguments to log-bernoulli-beta are usually small integers. Rather than compute the values on demand, it is much quicker to pre-compute the values for small integers and simply fetch them from the table.

These improvements to the code add about another factor of ten, so we can now cluster about one or two thousand data points in a reasonable amount of time. This turned out to be not quite enough for my problem and I needed to squeeze out some more performance. That is in the next post.

No comments: