20090930


We know that the Subset Sum problem is NP-Complete, and we know that the practical implication of this is that nobody  has found (and most believe that nobody will ever find) a polynomial time algorithm to solve this problem.

But suppose we absolutely have to solve this problem.  Let's call the original set S, and the target value k.  We could use the naive Brute Force and Ignorance (BFI) algorithm: enumerate all the subsets of S, and check the sum of each one to see if any of them sum to k.  Since there are 2^n subsets, this algorithm is going to take O(2^n) time.

It turns out we can do better ...


Suppose we split the set S into two equal sized subsets A and B.  If S has a subset that sums to k, then the subset is either completely in A, completely in B, or partly in A and partly in B.

Since we aren't using any kind of complicated selection process to split S in half, the split just take O(n) time.

Now we compute the sum of every subset of A, and the sum of every subset of B.  We can simply use the BFI algorithm to list all these subset sums - it takes O(2^(n/2)) time.
We can't do this any faster than O(2^(n/2)) time, since that is the number of subsets of each of A and B.  
If we find any subset of A or B that sums to k, we can stop right away - we have found what we are looking for.  
If none of the subset of A or of B sums to k, we need to deal with the possibility that S has a subset that sums to k, but that subset is partly in A and partly in B.  If that is the case, then the set of subset sums from A will contain some value a, and the set of subset sums from B will contain some value b, such that a+b = k.  The challenge is to find such a pair ... quickly.

The trick to doing this is to sort the two sets of subset sums (note that we are sorting the set of sum values for A and for B, not the sets A and B themselves).  Each of these sets has 2^(n/2) elements, so sorting each of them takes O(n * 2^(n/2)) time.

Call the sorted set of A's sums A*, and the sorted set of B's sums B*.

Now we can search for a winning pair (an "a" in A* and a "b" in B* such that a+b = k) by starting with the smallest value in A* and the largest value in B*.  If the sum of these values = k, we win and we stop!  If the sum of these values is < k, then we can forget about this value in A* - it cannot be part of a winning pair.  Similarly, if the sum of these values is > k, we can forget about this value in B* - it cannot be part of a winning pair.  Thus after adding these numbers together, we either stop, or we eliminate a value from A* or a value from B*.  Repeating this operation can only be done 2*2^(n/2) times - after that many iterations we will have eliminated all values in A* and B*.   Since each iteration takes constant time, this part of the algorithm takes O(2^(n/2)) time.

We see that the dominant term in the time analysis is the time for the sort:  O(n * 2^(n/2)), and thus this is the complexity of the algorithm.  Comparing this to the BFI algorithm, we see that this algorithm is an enormous improvement.  It still requires exponential time, but the exponent has been cut in half.

This algorithm illustrates the Divide and Conquer paradigm:

To solve a problem of size n  (typically involving a set of n elements)
    - divide the set into 2 or more disjoint subsets
    - solve the problem on the subsets independently
    - combine the subproblems' solutions to get the solution to the original problem

As we saw with the Subset Sum algorithm, D&C can be an effective technique for designing efficient algorithms.