mersenne twister party: Finally, how to Really Parallelize *Any* Stackless DFA

I posted a little while ago about my idea for parallelizing a Huffman Decode. What I didn't realize at the time is that this could be generalized to parallelize any stackless DFA. Here's a simple writeup of a method for calculating the end state of a stackless DFA with input A, on a computer with infinite processors:

Break down the input, A, into N equal (or nearly equal) substrings, A1, A2... AN.

Next, create groups of processors, G0, G1, G2... GN. GN will be one processor, which will start in the start state of the automaton. Each Gi (where i > 0) will be run on groups of S processors, G1, G2... GN, where S is the number of states in the machine, and each processor in that group will start in a different state of the machine.

Each processor in Gi will process the input Ai.

Call G0 "finished" when it has either reached a "finish" state (e.g. SError, or, say, a "terminal" leaf in a tree). Call Gi where i > 0 "finished" when a processor in Gi-1 has been contacted by a prior group and that processor has finished processing all of its input.

When Gi finishes, it is either in a "finish" state (i.e. SError) or another state. If it is in a "finish" state, return this state. If it is in another state, it contacts the processor in Gi+1 which started in the state where Gi finished. The state returned by GN is the final state of the automaton.

Simple enough, right? Funny that I haven't seen this before - and that I hadn't thought of it sooner.

Realistic Applications

Of course, the real drawback here is that for speedup N, we need (N-1)*(S) processors (where S is the number of states in the machine). If we have very few states and a huge input, this could potentially be useful. However, if we have a huge number of states, we will require an even more huge number of processors to get a speedup. Furthermore, this is going to generate a lot of wasted cycles and drive up electricity bills, which is an increasing concern.

However, if we take this general method as a model for dealing with parallel DFA modeling, then it starts to get pretty useful. For example, in the Huffman Decoding method I described in a prior post, we know that the DFA is actually a tree, so group Gi only need to simulate the subtrees which start at a given depth (i.e. i * [len(A0) + len(A1) + ... + len(Ai-1)]), which is likely to be much smaller than the number of states in the DFA.

More generally, if we know beforehand what states are reachable after a given number of inputs (which can be computed easily), we can often reduce the number of processors which must be used.

Long-term, Pie-in-the-Sky Approaches

The approach I described before is actually very similar to some forms of branch prediction. A certain branch-predicting processor might look at a branch that happens on a conditional, then actually start computing two different trains of logic - one based on the assumption that the conditional will evaluate to "true," and one on the assumption that the conditional will evaluate to "false" (okay, so most processors are a little more complex, but that's the idea). When the conditional is evalutated, one of these strings of computation is thrown out and the results of the other are retained.

Imagine if our comnputer knew that a certain computation-heavy function would return one of X different values, and simply assigned a processor to make all of these assumptions, and the results from one of the X processors was retained. It's not all that different from the method I've described for simulating DFAs.

Granted, at this point, this approach is only useful if we know that the function will return one of a handful of known values. But what if we effectively had an infinite number of processors to throw at this problem? Say that we had a function which returns a 4-byte integer. That's quite a few possibilities, but what if we still just assigned a processor to each of them? If the number of cores in a processor doubles every 18 months, then we might actually be able to do something like this - and while it would be a dumb way to parallelize all our sequential code, it would still be possible to do entirely at the hardware level.

In the meantime, more formal languages allow us to specify the exact range of values a function could return. Again, this could make such a "shotgun" approach as I have described much less wasteful.

mersenne twister party

arch detail

Thursday, March 08, 2007

Finally, how to Really Parallelize Any Stackless DFA

No comments:

Blog Archive

About Me

My Web Presence

Labels

Duct Tape And Turing Machines