arch detail

arch detail

Tuesday, February 27, 2007

The Usable Linux Softsynth Roundup

I've recently been blown away by the incredible sound expressiveness of this open-source synthesizer, zynaddsubfx. I've always known it had some really great sound possibilities, but I recently started actually studying its inner workings, and I'm completely amazed. If anyone here is interested in learning about making beautiful noises, definitely check it out.

The biggest deal for me is that the creator has invented something that appears to be a (as far as I know) new type of sound synthesis. He calls it PADSynth, and there's even a complete description of the algorithm, albeit in broken English, on the website. The brilliance of it is that it takes advantage of the fact that an inverse FFT always generates an effectively "windowed" sample, which for audio purposes means completely seamless, click-free looping.

The results are sounds that are simultaneously warm and rich, but with the possibility for bizarre, inorganic harmonics that when transport the listener to a bizarre alien landscape. I seriously lost several hours just listening to chords built from sawtooth waves and missed my Aikido class.

The only problem is that because you have to design the sample, then generate a gargantuan array with an inverse FFT, it's not entirely possible to hear your edits in realtime. I think that a reasonable imitation could be made if a low-priority thread simply recomputed a new sample every time the user made an edit and applied it whenever it was finished, though this might introduce an impossibly large CPU overhead on slower machines. It would make a nice optional feature for us SMP users, though.

I've also been examining a few other open-source synths, but none have impressed me that much.

Bristol has a terrible installer (no autotools, just a script!) and the sounds are somewhat lacking imitations of analog synths of yore.

Ceres3 does spectral modeling, has a dinky install environment, was built for IRIX, seems to be slowly joining the Linux world by way of PPC(!), and has no intent of being more than an educational tool. Worst of all, they even note that the source was "written in Vim." Ick.

Mammut is a similar spectral modeling tool just like Ceres3, but not a real-time synthesizer, per se, and also lacks a maintainer, requires absurd libraries I've never heard of, and uses goofy install tools.

ATS is a better-looking spectral modeler, but no JACK support. D'oh.

RTSynth is nothing but a toy - it for some reason tries to replace JACK and LADSPA with its own incompatible and limited solutions. Also a terrible installer.

Monday, February 26, 2007

A Parallel Huffman Coding Algorithm

In That Berkeley Article, I found the following:

Although 12 of the 13 Dwarfs possess some form of parallelism, finite state machines
(FSMs) look to be a challenge, which is why we made them the last dwarf. Perhaps FSMs
will prove to be embarrassingly sequential just as MapReduce is embarrassingly parallel.
If it is still important and does not yield to innovation in parallelism, that will be
disappointing, but perhaps the right long-term solution is to change the algorithmic
approach. In the era of multicore and manycore. Popular algorithms from the sequential
computing era may fade in popularity. For example, if Huffman decoding proves to be
embarrassingly sequential, perhaps we should use a different compression algorithm that
is amenable to parallelism.
Well, I thought about this for a bit, and I realized there are simple methods for using parallelization for Huffman decoding, provided there are sufficiently many nodes.

A quick Google search trivially demonstrates this fact. However, oddly, I can't find any freely available description of the technique they published.*

If you think about it, though, parallel Huffman Decoding shouldn't be hard if we have a suitably large number of processors, and we assume that communication time between processor elements is very short.

With Huffman coding, variable-length strings of 1s and 0s describe a path down a binary tree. The binary tree is uneven in length and some terminal leaves are closer than others (the closer leaves are for data items which are most commonly accessed).

I'll try to describe how this might work in fairly plain English and not use much formality, because it's pretty simple, really.

We require that our tree be implemented such that we can instantly calculate the location of a given subtree given its depth and which subtree at that level it is (the first, or second, or third...). This is possible if the tree is implemented as an array and unused slots are filled with a NULL character. (I'm not sure about the term for this implementation, but I've used it before).

All we need to do is choose a smallish integer, N, which will decide the amount of parallelism. Let's say our N is 4.

One processor will constitute the first set of workers, 2^4 = 16 processors will constitute the second, and 2^8 will constitute the third, and 2^12 will constitute the fourth, etc.

At time = 0, set one (the main processor) will decode the first N bits (4) of the coded tree.

Also at time = 0, each processor in set two looks at a different subtree starting at depth 4. Since our coded tree is implemented correctly, we do not need to descend the tree in order to find the subtrees, but can access them more or less instantly. There are at least as many processors as subtrees, and if we run out of subtrees, then some processors are inactive.

Also at time = 0, each processor in set three looks at a different subtree starting at depth 8. If there are more processors than subtrees, then some processors are inactive.

This continues for each set down the line, theoretically ad infinitum or until we run out of processors.

At time = 1, set one (the first processor) will have either found a terminal, or have found a subtree. That subtree will have been decoded by a processor in set two. That processor in set two will have found either a terminal or a subtree. That subtree will have been decoded by a processor in set three. That processor will have found a terminal or a subtree...

And so on.

So at time = 1, we just need to query processors as to whether they got a terminal or another subtree, then repeat until we get a terminal.

So our runtime is at N * (time to descend a level of a binary tree) + floor(total depth of the terminal / N) * (time to communicate result between two processors). Because of the way Huffman coding works, we can expect that the total depth of the terminal is small.

I'm pretty sure that this sorks. If anyone asks, I might write this up in a more formal way, but this seems so intuitive that I doubt it's worth the effort.

So in conclusion: Huffman coding is not embarrassingly sequential. In fact, its neatly parallel - if you have infinite processors and don't care about using them, anyway - and if this wasn't freely published before, it is now.



*Note: I haven't read any of these papers. The technique described here is entirely my own invention. It may or may not be similar to theirs.

Friday, February 23, 2007

Musings on Parallelism in Finite-State Automata

In a recent post, I responded to Berkeley's statement that FSA modeling is "embarassingly sequential," i.e. will not benefit from parallelization. (If you're unfamiliar with FSA, check out my introduction, which also links Wikipedia.

In this article, I'm addressing possible grounds of parallelization in the simpler two families of FSA: stackless finite-state automata (deterministic and non-deterministic finite-state automata, DFAs and NDFAs), and finite-state automata with one stack (push-down automata, PDAs). The final family, two-stack PDAs, which are equivalent to Turing machines, and are just too large a problem for me to address, at least for now.

I'll try to address these families of machines with regards to the following questions:

  1. Where and when can we parallelize? Are there formal methods for establishing parallelism?
  2. How can we make this parallelism useful, and what applications could benefit?

The First Family: Stackless Finite Automata

"Normal" DFAs/NDFAs don't have stacks; they can recognize regular expressions.

I was initially shocked to hear that Berkeley would say that FSM modeling can't be parallelized. This is partly because any introductory textbook notes that DFAs and NDFAs are equivalent in power, and uses the word "parallel" to describe the way that NDFAs work. With an NDFA, when we "non-deterministically" pursue two transitions on the same input, we can essentially think of our machine splitting into two idential copies, one of which goes through the first transition, one of which goes through the other. The two machines run in parallel. This would be easy to implement on a parallel-processor machine: you just spawn multiple threads.

In practice, though, this is not how NDFAs get implemented. They usually just get converted to DFAs. There is a known method for converting an NDFA to a corresponding DFA. This can lead to an exponential increase in number of states.

But this actually doesn't affect runtime. It seems intuitive that having exponentially fewer states would lead to exponentially faster runtimes, but it doesn't. In an NDFA , parsing 01110 takes five transitions (possibly more if you allow epsilon-transitions). In a DFA, parsing 01110 takes five transitions. This will be true for all strings. The minimum number of transitions is based on the number of input items. This is trivial to prove and I'm amazed I didn't realize it sooner.

There do exist methods for reducing NDFAs regarding the number of states and transitions. (See also: this one). They are NP-hard, but they could be accomplished.

I don't think the above maximizes non-determinism, but in most cases I would imagine that adding non-determinism is the best way to reduce the number of states. I'm also all but sure that for the same regular expression, the minimal NFA has as many or fewer states than the minimal DFA.

So Berkeley is essentially correct in this case - stackless DFAs and NDFAs are "embarrassingly sequential" regarding runtime. We could, however, reduce memory usage for some problems if we used NDFAs.

So in response to question 2, parallelizing FAs could be useful in the memory domain. Memory usage is going to be an increasingly large problem in the future, as on-chip memory is going to remain expensive even as the number of cores on a chip increases. And in the real world, memory access times are often the limiting factor in execution time for a computer program.

The problem with parallelizing an NDFA the naive way I described earlier is going to be the overhead of process creation and destruction. To solve this, imagine the following: we have an effectively infinite number of cores, each with a small per-core memory and a larger per-chip memory. To simulate an NDFA, execute the following steps:

  1. start an NDFA-modeling process on each core, and choose one as the "leader" process.
  2. let each process follow transitions simultaneously on the FA being modeled. They read instructions from the per-chip memory. If the transition is non-deterministic, the leader process divides the affected processes among the different transitions. If the transition is deterministic,
  3. Loop step 2 until input is exhausted.
  4. Check if any processes are in an accept state. If so, then accept.
This approach will allow use to utilize our many, many cores as a means to reduce memory usage, which means less costly fetches from more "distant" memory (RAM, hard disk) and effectively reduces run time. A clever implementation could remove the concept of "leader" process so no core-to-core communication is required. The only problem here, of course, is that the number of cores isn't really infinite, so we will need some ugly tricks to make sure we don't run out of cores.

Regular expressions get used all over the place, and there are known methods for converting from regular expression to NDFA. It would be mighty neat to have a more memory-lightweight Perl, though I'm not sure whether physics/gaming/business endeavors use a lot of NDFA models. However, if I'm right that there are methods for finding the non-determinism in FAs, then we can automatically seek out non-determinism and take advantage of it.





The Second Family: One-Stack Push-down Automata

One-stack push-down automata (generally, PDAs) and their non-deterministic brethren (NDPDAs) speak context-free grammar (CFGs).

While I haven't seen any method for maximizing non-determinism in a NDPDA, I do know that in real life, we see quite a bit of non-determinism when converting from CFGs to NDPDAs, so I haven't dedicated a whole lot of thought to figuring out how to maximize non-determinism or minimize the number of states.

So in response to question 1: We can definitely parallelize using the non-determinism of NDPAs in a way that reduces run-time.

In response to question 2: It's probably useful to bother parallelizing various implementations of PDAs.

Here's the gist of it: PDAs are used for recognizing items generated by a context-free grammar. They are also used in real life for generating "parse trees" which demonstrate how expressions in a CFG were derived. This is the first step in both compilation and interpretation, and is hugely useful in computer science.

In short, most textbooks (and numerous places on the internet) explain simple methods for converting a CFG to a NDPDA. The non-determinism is easy to take advantage of here in parallel systems.

As an example, say we have the following grammar (highly contrived):

[S] := [A] | [B]
[A] := 010 [A]
[B] := 000
where S, A, B are non-terminals and 0,1 are terminal. (This example is contrived, but it should be good for now).

A simple NDPDA would start with the following:

  1. Put S on the stack
  2. Nondeterministically choose between one of the following: popping S and pushing B, or popping S and pushing 0, 1, 0, A.

When the first input is consumed, one of the two paths will error and the other will, if the string exists in the language, be accepted.

While this language is highly contrived indeed, this is much the kind of thing that happens with real-life parsers.

It is also important to note here that there are many families of CFGs; some are infinitely recursive in a way that deterministic models can't deal with, and parallel implementations won't be able to deal with either. So unlike when we discuss stackless FMs, we now have a distinction between parallel and non-deterministic.

So in response to question 2: So long as our context-free languages are not infinitely recursive, we can definitely parallelize. We descend potential parse trees all the time, and often have to reject them and try new ones. Descending multiple parse trees at once can potentially reduce a huge amount of parsing time.

Of course, parsing is often considered a one-time job. Most software gets parsed, compiled, and saved in a compiled form. But I strongly hope that parallel interpretation will help improve the latency of interpreted languages, as interpreted languages are generally more flexible and portable.

It would be particularly nice to see a language designed for maximally parallel interpretation. Further, it would even be possible to non-deterministically make assumptions about code as it is being parsed and attempt various optimized branches, throwing away those which are based on incorrect assumptions. With a sufficient number of cores, this approach might actually see some beneficial returns (whereas currently, optimized just-in-time compilation adds more latency than it removes).


Conclusions

In summary, I think that non-determinism and parallelism have an important but surprisingly complex relationship, and while many finite-state automata problems won't be improved by increasingly parallel technology, the "manycore" movement might speed up some problems (stackless FAs), will definitely speed up others (CFGs/PDAs) and hopefully encourage some new developments (better interpreted languages).

Short Intro to Finite-State Automata (FSA, DFA, and NDFA)

I'm adding this post so I can clarify my writing on the topic of parallelizing finite-state automata (see my previous entry) and so my non-CS friends can potentially try to figure out what I'm so interested in right now.

Wikipedia has a pretty good article, too.

Discrete Finite Automata: DFAs



An FSA (sometimes "finite-state machine," FSM) is an imaginary sort of machine which is a set of states and a set of transitions to other states. The article notes that collision detection is an FSA problem, but I've been raised on thinking of FSAs as being language parsing machines. A very simple FSA is a discrete finite automata (or DFA). A DFA has a finite number of states, and given an event - like finding a 0 or a 1 an "A" or a "B", the machine moves to another state.

An example is a machine which operates on the alphabet {A, B} and runs as long as it sees alternating "A"s and "B"s, e.g. ABABAB, but not ABBA. I.E., the machine runs on the alphabet {A, B} and accepts the language of zero or more instances of the string "AB."

We'll always start in a state called S0. In S0, there is a possible transition on A to S1 and on B a transition to S2. In S1, there is a transition on B to S2, in S2, we transition to S1. On all other events, transition to Serror, indicating that the machine is in a "reject" state. So a graph of this might look like this:

S0: A -> S1
B -> S2

S1: B -> S2
A -> Serror

S2: A -> S1
B -> Serror

Serror: A or B -> Serror.

So now if we start in S0 and feed the machine the inputs A, B, A it goes through states:
S0, S1, S2, S1.

If we feed the machine ABBA, it goes through states:
S0, S1, S2, Serror, Serror, and we know ABBA is not in our "language" of alternating As and Bs.

(I am simplifying a little here - technically all DFAs have an "accept" state as well as an error state, and there's an implicit "end of word" character. If you care, imagine that on "end of word", S1 and S2 transition to SAccept.)

It is also worth noting here that we can modify DFAs - for example, we can have DFAs with stack behavior (when taking input, they look at the top of an imaginary stack, and then push or pop something on/from the stack). DFAs with stacks are generally called "Push-down automata." PDAs can actually parse languages like C++, or Python, or Algol, or Lisp. They're very powerful indeed.

So in short, a DFA is a very powerful tool, and one that is generally well-researched, and if you believe that any math problem is actually a language problem (hint: it is), then a DFA can actually be created that solves pretty much any problem. And of course, because they are well-researched and easy to model on a computer, they get used quite a lot.


Non-Deterministic Finite-State Automata



There are also a whole family of FSA called non-deterministic finite state automata, or NDFA. These machines are generally somewhat difficult to conceptualize, because at any given time we need to imagine the machine as being in a set of one or more states. We might have a rule like this:

S0: A -> S1
A -> S2
B -> S3

i.e.: When in state S0, if given the input A, we now transition to states S1 and S2.

It is often said that when using an NDFA, these kinds of transitions invite us to think of the machine as splitting into several different machines, and if any of those machines reach an accept state, then the word is in our language.

Tuesday, February 20, 2007

In Response to Berkeley's "Landscape of Parallel Computing Research"

I recently read The Landscape of Parallel Computing Research: A View from Berkeley. It's a really great article, and it's freely available, and if you're interested in the future of high-performance computing or operating systems (or generally any computing that will be affected by the "manycore" movement in processor design) I strongly suggest it.

In the article, they decide on the thirteen major computation kernels which will be the basis of most problems in the future, essentially setting up the SPEC of the future. Of the thirteen, twelve are heavily parallel.

The thirteenth problem is modeling finite-state automata (though they use the words "finite-state machine" or FSM). If you don't know what FSAs are, I just posted a brief summary in this blog which should be just enough to help this entry make sense.

So now, back to the article. The Berkeley paper mentions the following in passing:

Perhaps FSMs will prove to be embarrassingly sequential just as [another problem] is embarrassingly parallel. (page 14).
I found this statement to be immensely interesting, and actually a major oversimplification. You can imagine that computing the state of an discrete finite machine doesn't lend itself to much parallelism. At any given time we are in exactly one state, and after one input, we move to exactly one more state.

And if FSAs are hopelessly sequential, then we have a whole family of problems that can't be solved any faster by an 80-core processor in 2010 than a 1-core processor in 2002.

However, this does not factor in the possibility that our FSA is not a DFA. There are also a whole family of FSA called non-deterministic finite state automata, or NDFA. These machines are generally somewhat difficult to conceptualize, because at any given time we need to imagine the machine as being in a set of one or more states. We might have a rule like this:

S0: A -> S1
A -> S2
B -> S3

i.e.: When in state S0, if given the input A, we now transition to states S1 and S2.

It is a known fact that any NDFA can be trivially converted back to a DFA, which is just wonderful for sequential programming. But a well-designed NDFA can be much simpler (far fewer states, fewer steps towards completion) than an equivalent DFA. And in fact, NDFAs are often more intuitive for a human to design - and can easily model things like lambda calculus, which requires non-deterministic parsing.

However, NDFAs aren't very popular because:
  • NDFAs can't do anything a DFA can't (because every NDFA has an equivalent DFA), and
  • There are very simple, well-researched methods for turning an NDFA into a DFA
  • There are simple known rules and methods for manipulating DFAs automatically
  • DFAs get along very nicely with sequential programming.
So historically, since DFAs are simple and get along quite well with sequential programming, NDFAs have, in my past education, been treated as sort of a footnote. I'm not even sure there exist known methods for converting a complex DFA into a simpler NDFA (though I'd imagine there are).

In conclusion, I refute the statement by Berkeley that FSAs are embarassingly parallel and thus problematic for the future. What we need is a lot more research into a tricky and complex topic, particularly methods for creating simpler NDFA (this might be very tricky - the point of being an AI problem) and tools that allow us to utilize multiple processors to model NDFAs (this should be pretty simple).

More on this in a later post.