arch detail

arch detail

Friday, February 23, 2007

Musings on Parallelism in Finite-State Automata

In a recent post, I responded to Berkeley's statement that FSA modeling is "embarassingly sequential," i.e. will not benefit from parallelization. (If you're unfamiliar with FSA, check out my introduction, which also links Wikipedia.

In this article, I'm addressing possible grounds of parallelization in the simpler two families of FSA: stackless finite-state automata (deterministic and non-deterministic finite-state automata, DFAs and NDFAs), and finite-state automata with one stack (push-down automata, PDAs). The final family, two-stack PDAs, which are equivalent to Turing machines, and are just too large a problem for me to address, at least for now.

I'll try to address these families of machines with regards to the following questions:

  1. Where and when can we parallelize? Are there formal methods for establishing parallelism?
  2. How can we make this parallelism useful, and what applications could benefit?

The First Family: Stackless Finite Automata

"Normal" DFAs/NDFAs don't have stacks; they can recognize regular expressions.

I was initially shocked to hear that Berkeley would say that FSM modeling can't be parallelized. This is partly because any introductory textbook notes that DFAs and NDFAs are equivalent in power, and uses the word "parallel" to describe the way that NDFAs work. With an NDFA, when we "non-deterministically" pursue two transitions on the same input, we can essentially think of our machine splitting into two idential copies, one of which goes through the first transition, one of which goes through the other. The two machines run in parallel. This would be easy to implement on a parallel-processor machine: you just spawn multiple threads.

In practice, though, this is not how NDFAs get implemented. They usually just get converted to DFAs. There is a known method for converting an NDFA to a corresponding DFA. This can lead to an exponential increase in number of states.

But this actually doesn't affect runtime. It seems intuitive that having exponentially fewer states would lead to exponentially faster runtimes, but it doesn't. In an NDFA , parsing 01110 takes five transitions (possibly more if you allow epsilon-transitions). In a DFA, parsing 01110 takes five transitions. This will be true for all strings. The minimum number of transitions is based on the number of input items. This is trivial to prove and I'm amazed I didn't realize it sooner.

There do exist methods for reducing NDFAs regarding the number of states and transitions. (See also: this one). They are NP-hard, but they could be accomplished.

I don't think the above maximizes non-determinism, but in most cases I would imagine that adding non-determinism is the best way to reduce the number of states. I'm also all but sure that for the same regular expression, the minimal NFA has as many or fewer states than the minimal DFA.

So Berkeley is essentially correct in this case - stackless DFAs and NDFAs are "embarrassingly sequential" regarding runtime. We could, however, reduce memory usage for some problems if we used NDFAs.

So in response to question 2, parallelizing FAs could be useful in the memory domain. Memory usage is going to be an increasingly large problem in the future, as on-chip memory is going to remain expensive even as the number of cores on a chip increases. And in the real world, memory access times are often the limiting factor in execution time for a computer program.

The problem with parallelizing an NDFA the naive way I described earlier is going to be the overhead of process creation and destruction. To solve this, imagine the following: we have an effectively infinite number of cores, each with a small per-core memory and a larger per-chip memory. To simulate an NDFA, execute the following steps:

  1. start an NDFA-modeling process on each core, and choose one as the "leader" process.
  2. let each process follow transitions simultaneously on the FA being modeled. They read instructions from the per-chip memory. If the transition is non-deterministic, the leader process divides the affected processes among the different transitions. If the transition is deterministic,
  3. Loop step 2 until input is exhausted.
  4. Check if any processes are in an accept state. If so, then accept.
This approach will allow use to utilize our many, many cores as a means to reduce memory usage, which means less costly fetches from more "distant" memory (RAM, hard disk) and effectively reduces run time. A clever implementation could remove the concept of "leader" process so no core-to-core communication is required. The only problem here, of course, is that the number of cores isn't really infinite, so we will need some ugly tricks to make sure we don't run out of cores.

Regular expressions get used all over the place, and there are known methods for converting from regular expression to NDFA. It would be mighty neat to have a more memory-lightweight Perl, though I'm not sure whether physics/gaming/business endeavors use a lot of NDFA models. However, if I'm right that there are methods for finding the non-determinism in FAs, then we can automatically seek out non-determinism and take advantage of it.





The Second Family: One-Stack Push-down Automata

One-stack push-down automata (generally, PDAs) and their non-deterministic brethren (NDPDAs) speak context-free grammar (CFGs).

While I haven't seen any method for maximizing non-determinism in a NDPDA, I do know that in real life, we see quite a bit of non-determinism when converting from CFGs to NDPDAs, so I haven't dedicated a whole lot of thought to figuring out how to maximize non-determinism or minimize the number of states.

So in response to question 1: We can definitely parallelize using the non-determinism of NDPAs in a way that reduces run-time.

In response to question 2: It's probably useful to bother parallelizing various implementations of PDAs.

Here's the gist of it: PDAs are used for recognizing items generated by a context-free grammar. They are also used in real life for generating "parse trees" which demonstrate how expressions in a CFG were derived. This is the first step in both compilation and interpretation, and is hugely useful in computer science.

In short, most textbooks (and numerous places on the internet) explain simple methods for converting a CFG to a NDPDA. The non-determinism is easy to take advantage of here in parallel systems.

As an example, say we have the following grammar (highly contrived):

[S] := [A] | [B]
[A] := 010 [A]
[B] := 000
where S, A, B are non-terminals and 0,1 are terminal. (This example is contrived, but it should be good for now).

A simple NDPDA would start with the following:

  1. Put S on the stack
  2. Nondeterministically choose between one of the following: popping S and pushing B, or popping S and pushing 0, 1, 0, A.

When the first input is consumed, one of the two paths will error and the other will, if the string exists in the language, be accepted.

While this language is highly contrived indeed, this is much the kind of thing that happens with real-life parsers.

It is also important to note here that there are many families of CFGs; some are infinitely recursive in a way that deterministic models can't deal with, and parallel implementations won't be able to deal with either. So unlike when we discuss stackless FMs, we now have a distinction between parallel and non-deterministic.

So in response to question 2: So long as our context-free languages are not infinitely recursive, we can definitely parallelize. We descend potential parse trees all the time, and often have to reject them and try new ones. Descending multiple parse trees at once can potentially reduce a huge amount of parsing time.

Of course, parsing is often considered a one-time job. Most software gets parsed, compiled, and saved in a compiled form. But I strongly hope that parallel interpretation will help improve the latency of interpreted languages, as interpreted languages are generally more flexible and portable.

It would be particularly nice to see a language designed for maximally parallel interpretation. Further, it would even be possible to non-deterministically make assumptions about code as it is being parsed and attempt various optimized branches, throwing away those which are based on incorrect assumptions. With a sufficient number of cores, this approach might actually see some beneficial returns (whereas currently, optimized just-in-time compilation adds more latency than it removes).


Conclusions

In summary, I think that non-determinism and parallelism have an important but surprisingly complex relationship, and while many finite-state automata problems won't be improved by increasingly parallel technology, the "manycore" movement might speed up some problems (stackless FAs), will definitely speed up others (CFGs/PDAs) and hopefully encourage some new developments (better interpreted languages).

No comments: