mersenne twister party: 2007

Friday, November 09, 2007

Gentoo and Firefox-bin

On my home Gentoo machine, my Firefox experience has been painfully slow lately. I haven't been sure why. So I emerged firefox-bin, the pre-compiled version. It's probably twice as fast - though still sluggish compared to comparable Fedora machines I've been working with. What the heck is Gentoo doing so wrong? Ack.

Portage is the right approach for managing packages. I don't know if I could stand apt. It's a shame that Portage so caught up in the Gentoo community, which is just painfully stupid these days.

Friday, October 26, 2007

Hardware tools for automated rootkit detection

eBay recently reported that surprisingly, a large number of their Linux boxes falling victim to rootkits and joining spam botnets. So Linux is officially a target, which got me thinking...

As far as I can know, the best way to beat a rootkit is to run another, trusted OS and have it examine the disk of the suspect machine, for example checksumming the kernel image and other
system files to make sure they haven't changed.

Unfortunately, this requires a human hand to shut down the machine and boot off of a CD.

It would be nice if, for example, you could configure a cron job to shut down the machine periodically and just leave a CD in the tray, with BIOS configured to boot the CD and do rootkit detection before rebooting from the disk.

Unfortunately, the rootkitted OS could simply refuse to reboot, or even give a reasonable impression of rebooting and scanning itself.

So I've been thinking that it would be more practical for companies like eBay, who have significant resources and wish to avoid becoming botnets, to have a single, very secure
computer (thoroughly protected from the outside world) which would be capable of forceably shutting down other machines, mounting their disks, and doing checksums, over a dedicated and very secure network. If the serverload is already spread out and redundant
for fault-tolerance, this wouldn't really be an inconvenience.

Unfortunately, I'm not sure this is possible - you couldn't buy, off the shelf, a machine with a NIC wired directly to the motherboard with the ability to just power off the whole machine,
without even contacting the OS.

The Trusted Platform Module, or TPM, is actually capable of doing something similar. It can DMA system memory and potentially cut power to the CPU if certain conditions are or are not present. The problem, then, is how to communicate with the TPM without going through the parent OS.

Anybody know a good way? And how hard could it really be for hardware manufacturers to build in the required tools to shut down a system from a remote machine? Do any such hardware manufacturers exist?

Saturday, August 04, 2007

Kororaa Linux

It's been a long few months!

I've been busy lately, hence the lack of updates. I'm back to work at LSST, integrating the Moving Object Pipeline System (which identifies objects in various images of the sky, then identifies common objects in images, then builds a database of their orbits and predicts their motion) into the overall LSST data framework.

Today, I took a break from it all to install Kororaa Linux. Kororaa is a Gentoo fork which has a LiveCD with full AIGLX/XGL environments, allowing all the nifty "wobbly windows" effects and OS X-like expose features. Since I now have an ATI Radeon card, I thought it might be a fun toy. Best of all, the install feature actually gets a whole running Gentoo system installed in under an hour, and pre-configured for super-slick video tricks.

Kororaa uses only open-source drivers, so it is morally correct, but unfortunately, the open-source Radeon drivers proved to be tragically slow. While I thought the lagginess of the system was related to running off a LiveCD, it didn't go away when I installed to a hard drive.

Furthermore, it didn't install cleanly. GRUB was set up incorrectly, and the xorg.conf was actually not quite right. I was willing to forgive all this - after all, GRUB hard disk nomenclature doesn't always coincide with Linux's, and the whole thing is based on Gentoo, so I could tweak the video drivers all I wanted. Or so I thought.

Unfortunately, 'emerge sync' managed to break Portage.

Three strikes, Kororaa. You were fun while you lasted. I might be back, someday, to manually install portage and try out some new video drivers, but don't count on it.

Sunday, May 13, 2007

Setting up my new Website with Joomla!

So I decided to build jonny-ash.com as a combination professional/personal resource. For employers, my site needs the following:

Contact information that is protected from spam-bots
Resume
Some "about me" information
References to this blog, which though unprofessional, does demonstrate my dedication to computer science.

For friends and family, I wanted:

Links to all my identities on different web resources (YouTube, MySpace, Flickr)
Aggregation (as much as possible) of those identities

This essentially really just boils down to a few use cases. I seriously considered doing it with PHP and calling it a day, but I decided to be adventurous and try getting out of the "coder" mindset and instead using an existing content management system - after all, this "Web 2.0" noise is essentially about using existing software over the web, rather than doing everything yourself. Since Joomla! has a canned installation system on DreamHost (my web host), it seemed like the clear choice.

Now that all is said and done, I'm actually fairly happy with Joomla! as a tool. It's a MySQL-backed system that does a good job of differentiating between style presentation, content, and navigation. You have a user-friendly backend which allows addition of content and drop-in style template changes (totally configurable, if you really care) and all the actual pages are generated with PHP. It's virtually seamless.

Of course, CSS allows us to separate style information from content in HTML, and this is generally good practice that is observed all over. But allowing navigation to remain distinct from content is something really handy, and it's great to be achieve it without using hack solutions (IMHO) like DreamWeaver.

It integrates neatly with Flickr and LiveJournal using Feed Gator as an extension. Blogger actually comes out pretty ugly because there's no "title" heading for Blogger entries, which makes everything pretty unreadable. There are tons of other components and modules that are freely available, as well, and some pretty slick templates.

So though it was overkill, and setup probably took twice what the PHP coding would have taken me, and although Joomla! is occasionally so complex that it gives me headaches, I'm pleased with my decision. Why? Well, mostly because now it takes .5 seconds to install a new theme and completely revamp my website's look. Having spent a summer to accomplish just that much for a college department in the old days, that's pretty danged satisfying.

Saturday, May 12, 2007

jonny-ash.com

Introducing jonny-ash.com.

I've finally invested in some real web hosting via DreamHost.

Wednesday, March 21, 2007

The ugly side of the "Web 2.0" approach

I recently did some work on my Google Summer of Code application. They want to know where my online resume is published. I put one up on Google docs and made it public. Now anyone can spider the Web and read my home address, phone number, and e-mail address. The e-mail addres was intentionally obfuscated from XYZ@gmail.com to "GMail: XYZ" in hopes that I won't be inundated with "VIxAy Gzzzz RA" and "hot nuked grils" in my inbox.

So now I find that I have the following:

Flickr account

A Google Docs account featuring my important information for employers

YouTube account (with new videos, including Aikido)

Mersenne Twister Party, my Blogger weblog

A LiveJournal account

A Facebook entry

So now we have tools like OpenID, but of the above, LiveJournal is the only one that supports it, so that helps, uh, none. Friend-of-a-friend is a machine-friendly method for describing relationships between people (like how we have "friends" on Facebook, and "friends" on LiveJournal, and "friends" flickr.) But again, nobody uses it, so that doesn't help.

So I could list on LJ, Facebook, Blogger, YouTube... etc, every one of this increasingly large list of content hosts. This is annoying, and it also means that a potential employer who reads my blog has to go to my Blogger profile to find my resume, and they'll also find things like my personal music work (which, honestly, isn't something that makes me feel at ease.) And really, I want my friends to know about my YouTube videos and music and such. And I shouldn't have to obfuscate my actual contact information in my resume or expect employers to use Facebook (heh).

So I'm now contemplating going whole-hog before my inbox gets spammed to death and just buying some web space and registering, say, jonny-ash.com. Here's what I want for myself:

Some web service which will keep track of my online identity in all its forms. Something that says "I am X on OpenID, Y on Blogger, Z on Flickr." I also want it to allow me to present myself in a professional way while keeping added goodies for people who want to know about me personally, and if possible, I'd like to restrict access to the personal stuff to friends. And lastly, I'd like it to have actual contact information that is protected from spiders and crawlers that sell my information to spammers.

The thing is, a service which *does* all this is totally possible. Nobody's done it yet.

I could.

Here's how it would work: It should support open standards like OpenID and FOAF. When a user logs in, they can set up an account with links to their various web resources. They should be able to add contact information and make it public or private but hide it behind a "verify that you are human" image so spiders can't read it. They should also be able to hide other information behind FoaF, as in, my "friends" and their "friends" (ideally, whether they are "friends" on Facebook, or LJ, or...) should be able to see my Aikido videos and music recordings. Other people shouldn't.

All of this should be pretty trivial to do, and I think I could get the web hosting for, say, $6 a month. I could sell advertising.

Would anyone be interested? Does anyone care? Should I just set up a personal website for myself and let everyone else do the same...?

Thursday, March 08, 2007

Finally, how to Really Parallelize Any Stackless DFA

I posted a little while ago about my idea for parallelizing a Huffman Decode. What I didn't realize at the time is that this could be generalized to parallelize any stackless DFA. Here's a simple writeup of a method for calculating the end state of a stackless DFA with input A, on a computer with infinite processors:

Break down the input, A, into N equal (or nearly equal) substrings, A1, A2... AN.

Next, create groups of processors, G0, G1, G2... GN. GN will be one processor, which will start in the start state of the automaton. Each Gi (where i > 0) will be run on groups of S processors, G1, G2... GN, where S is the number of states in the machine, and each processor in that group will start in a different state of the machine.

Each processor in Gi will process the input Ai.

Call G0 "finished" when it has either reached a "finish" state (e.g. SError, or, say, a "terminal" leaf in a tree). Call Gi where i > 0 "finished" when a processor in Gi-1 has been contacted by a prior group and that processor has finished processing all of its input.

When Gi finishes, it is either in a "finish" state (i.e. SError) or another state. If it is in a "finish" state, return this state. If it is in another state, it contacts the processor in Gi+1 which started in the state where Gi finished. The state returned by GN is the final state of the automaton.

Simple enough, right? Funny that I haven't seen this before - and that I hadn't thought of it sooner.

Realistic Applications

Of course, the real drawback here is that for speedup N, we need (N-1)*(S) processors (where S is the number of states in the machine). If we have very few states and a huge input, this could potentially be useful. However, if we have a huge number of states, we will require an even more huge number of processors to get a speedup. Furthermore, this is going to generate a lot of wasted cycles and drive up electricity bills, which is an increasing concern.

However, if we take this general method as a model for dealing with parallel DFA modeling, then it starts to get pretty useful. For example, in the Huffman Decoding method I described in a prior post, we know that the DFA is actually a tree, so group Gi only need to simulate the subtrees which start at a given depth (i.e. i * [len(A0) + len(A1) + ... + len(Ai-1)]), which is likely to be much smaller than the number of states in the DFA.

More generally, if we know beforehand what states are reachable after a given number of inputs (which can be computed easily), we can often reduce the number of processors which must be used.

Long-term, Pie-in-the-Sky Approaches

The approach I described before is actually very similar to some forms of branch prediction. A certain branch-predicting processor might look at a branch that happens on a conditional, then actually start computing two different trains of logic - one based on the assumption that the conditional will evaluate to "true," and one on the assumption that the conditional will evaluate to "false" (okay, so most processors are a little more complex, but that's the idea). When the conditional is evalutated, one of these strings of computation is thrown out and the results of the other are retained.

Imagine if our comnputer knew that a certain computation-heavy function would return one of X different values, and simply assigned a processor to make all of these assumptions, and the results from one of the X processors was retained. It's not all that different from the method I've described for simulating DFAs.

Granted, at this point, this approach is only useful if we know that the function will return one of a handful of known values. But what if we effectively had an infinite number of processors to throw at this problem? Say that we had a function which returns a 4-byte integer. That's quite a few possibilities, but what if we still just assigned a processor to each of them? If the number of cores in a processor doubles every 18 months, then we might actually be able to do something like this - and while it would be a dumb way to parallelize all our sequential code, it would still be possible to do entirely at the hardware level.

In the meantime, more formal languages allow us to specify the exact range of values a function could return. Again, this could make such a "shotgun" approach as I have described much less wasteful.

D-Wave's Quantum Computing Not Totally Legit

I've been trying to learn about that damned quantum computing demonstration from D-Wave for a while now. Ars Technica's article is the first one that's satisfied me.

The most important detail: Quantum computing, at least as D-WAVE has it, does not mean that NP-complete problems will be solvable in polynomial time. I had heretofore been under the impression that they might be.

Thus, computer scientists will not be rendered obsolete.

Thank you, Ars Technica, for clearing that up.

OpenBox is Great

I've grown disenfranchised with FluxBox. I'd been loyal, but I had some very strange issues with X/FluxBox going into a nasty freeze and the FB community was less-than-helpful.

I've tried Blackbox, and I have to say, it's too damned ugly. Admittedly, I never tried very hard at fixing that.

So today I tried OpenBox. It is a breath of fresh air. It is beautiful. It gets all my fonts right. And the more I learn about it, the better it seems. I think my first feeling of excitement came from reading the OpenBox "about" page:

Openbox works with your applications, and makes your desktop easier to manage. This is because the approach to its development was the opposite of what seems to be the general case for window managers. Openbox was written first to comply with standards and to work properly. Only when that was in place did the team turn to the visual interface.

Finally, a WM with it's head on straight.

The behavior is way more configurable than any WM I've seen before, because anything you can imagine is specifiable through an XML configuration file. While the difficulty of setting up a simple menu with XML is not easily overlooked, tools like denu greatly simplify the process.

One of the things that most impressed me was that they made a point of complying with freedesktop.org standards. If the post-fd.o world is really this well-done, the Free world will finally be a wonderful thing.

Tuesday, February 27, 2007

The Usable Linux Softsynth Roundup

I've recently been blown away by the incredible sound expressiveness of this open-source synthesizer, zynaddsubfx. I've always known it had some really great sound possibilities, but I recently started actually studying its inner workings, and I'm completely amazed. If anyone here is interested in learning about making beautiful noises, definitely check it out.

The biggest deal for me is that the creator has invented something that appears to be a (as far as I know) new type of sound synthesis. He calls it PADSynth, and there's even a complete description of the algorithm, albeit in broken English, on the website. The brilliance of it is that it takes advantage of the fact that an inverse FFT always generates an effectively "windowed" sample, which for audio purposes means completely seamless, click-free looping.

The results are sounds that are simultaneously warm and rich, but with the possibility for bizarre, inorganic harmonics that when transport the listener to a bizarre alien landscape. I seriously lost several hours just listening to chords built from sawtooth waves and missed my Aikido class.

The only problem is that because you have to design the sample, then generate a gargantuan array with an inverse FFT, it's not entirely possible to hear your edits in realtime. I think that a reasonable imitation could be made if a low-priority thread simply recomputed a new sample every time the user made an edit and applied it whenever it was finished, though this might introduce an impossibly large CPU overhead on slower machines. It would make a nice optional feature for us SMP users, though.

I've also been examining a few other open-source synths, but none have impressed me that much.

Bristol has a terrible installer (no autotools, just a script!) and the sounds are somewhat lacking imitations of analog synths of yore.

Ceres3 does spectral modeling, has a dinky install environment, was built for IRIX, seems to be slowly joining the Linux world by way of PPC(!), and has no intent of being more than an educational tool. Worst of all, they even note that the source was "written in Vim." Ick.

Mammut is a similar spectral modeling tool just like Ceres3, but not a real-time synthesizer, per se, and also lacks a maintainer, requires absurd libraries I've never heard of, and uses goofy install tools.

ATS is a better-looking spectral modeler, but no JACK support. D'oh.

RTSynth is nothing but a toy - it for some reason tries to replace JACK and LADSPA with its own incompatible and limited solutions. Also a terrible installer.

Monday, February 26, 2007

A Parallel Huffman Coding Algorithm

In That Berkeley Article, I found the following:

Although 12 of the 13 Dwarfs possess some form of parallelism, finite state machines
(FSMs) look to be a challenge, which is why we made them the last dwarf. Perhaps FSMs
will prove to be embarrassingly sequential just as MapReduce is embarrassingly parallel.
If it is still important and does not yield to innovation in parallelism, that will be
disappointing, but perhaps the right long-term solution is to change the algorithmic
approach. In the era of multicore and manycore. Popular algorithms from the sequential
computing era may fade in popularity. For example, if Huffman decoding proves to be
embarrassingly sequential, perhaps we should use a different compression algorithm that
is amenable to parallelism.

Well, I thought about this for a bit, and I realized there are simple methods for using parallelization for Huffman decoding, provided there are sufficiently many nodes.

A quick Google search trivially demonstrates this fact. However, oddly, I can't find any freely available description of the technique they published.*

If you think about it, though, parallel Huffman Decoding shouldn't be hard if we have a suitably large number of processors, and we assume that communication time between processor elements is very short.

With Huffman coding, variable-length strings of 1s and 0s describe a path down a binary tree. The binary tree is uneven in length and some terminal leaves are closer than others (the closer leaves are for data items which are most commonly accessed).

I'll try to describe how this might work in fairly plain English and not use much formality, because it's pretty simple, really.

We require that our tree be implemented such that we can instantly calculate the location of a given subtree given its depth and which subtree at that level it is (the first, or second, or third...). This is possible if the tree is implemented as an array and unused slots are filled with a NULL character. (I'm not sure about the term for this implementation, but I've used it before).

All we need to do is choose a smallish integer, N, which will decide the amount of parallelism. Let's say our N is 4.

One processor will constitute the first set of workers, 2^4 = 16 processors will constitute the second, and 2^8 will constitute the third, and 2^12 will constitute the fourth, etc.

At time = 0, set one (the main processor) will decode the first N bits (4) of the coded tree.

Also at time = 0, each processor in set two looks at a different subtree starting at depth 4. Since our coded tree is implemented correctly, we do not need to descend the tree in order to find the subtrees, but can access them more or less instantly. There are at least as many processors as subtrees, and if we run out of subtrees, then some processors are inactive.

Also at time = 0, each processor in set three looks at a different subtree starting at depth 8. If there are more processors than subtrees, then some processors are inactive.

This continues for each set down the line, theoretically ad infinitum or until we run out of processors.

At time = 1, set one (the first processor) will have either found a terminal, or have found a subtree. That subtree will have been decoded by a processor in set two. That processor in set two will have found either a terminal or a subtree. That subtree will have been decoded by a processor in set three. That processor will have found a terminal or a subtree...

And so on.

So at time = 1, we just need to query processors as to whether they got a terminal or another subtree, then repeat until we get a terminal.

So our runtime is at N * (time to descend a level of a binary tree) + floor(total depth of the terminal / N) * (time to communicate result between two processors). Because of the way Huffman coding works, we can expect that the total depth of the terminal is small.

I'm pretty sure that this sorks. If anyone asks, I might write this up in a more formal way, but this seems so intuitive that I doubt it's worth the effort.

So in conclusion: Huffman coding is not embarrassingly sequential. In fact, its neatly parallel - if you have infinite processors and don't care about using them, anyway - and if this wasn't freely published before, it is now.

*Note: I haven't read any of these papers. The technique described here is entirely my own invention. It may or may not be similar to theirs.

Friday, February 23, 2007

Musings on Parallelism in Finite-State Automata

In a recent post, I responded to Berkeley's statement that FSA modeling is "embarassingly sequential," i.e. will not benefit from parallelization. (If you're unfamiliar with FSA, check out my introduction, which also links Wikipedia.

In this article, I'm addressing possible grounds of parallelization in the simpler two families of FSA: stackless finite-state automata (deterministic and non-deterministic finite-state automata, DFAs and NDFAs), and finite-state automata with one stack (push-down automata, PDAs). The final family, two-stack PDAs, which are equivalent to Turing machines, and are just too large a problem for me to address, at least for now.

I'll try to address these families of machines with regards to the following questions:

Where and when can we parallelize? Are there formal methods for establishing parallelism?
How can we make this parallelism useful, and what applications could benefit?

The First Family: Stackless Finite Automata

"Normal" DFAs/NDFAs don't have stacks; they can recognize regular expressions.

I was initially shocked to hear that Berkeley would say that FSM modeling can't be parallelized. This is partly because any introductory textbook notes that DFAs and NDFAs are equivalent in power, and uses the word "parallel" to describe the way that NDFAs work. With an NDFA, when we "non-deterministically" pursue two transitions on the same input, we can essentially think of our machine splitting into two idential copies, one of which goes through the first transition, one of which goes through the other. The two machines run in parallel. This would be easy to implement on a parallel-processor machine: you just spawn multiple threads.

In practice, though, this is not how NDFAs get implemented. They usually just get converted to DFAs. There is a known method for converting an NDFA to a corresponding DFA. This can lead to an exponential increase in number of states.

But this actually doesn't affect runtime. It seems intuitive that having exponentially fewer states would lead to exponentially faster runtimes, but it doesn't. In an NDFA , parsing 01110 takes five transitions (possibly more if you allow epsilon-transitions). In a DFA, parsing 01110 takes five transitions. This will be true for all strings. The minimum number of transitions is based on the number of input items. This is trivial to prove and I'm amazed I didn't realize it sooner.

There do exist methods for reducing NDFAs regarding the number of states and transitions. (See also: this one). They are NP-hard, but they could be accomplished.

I don't think the above maximizes non-determinism, but in most cases I would imagine that adding non-determinism is the best way to reduce the number of states. I'm also all but sure that for the same regular expression, the minimal NFA has as many or fewer states than the minimal DFA.

So Berkeley is essentially correct in this case - stackless DFAs and NDFAs are "embarrassingly sequential" regarding runtime. We could, however, reduce memory usage for some problems if we used NDFAs.

So in response to question 2, parallelizing FAs could be useful in the memory domain. Memory usage is going to be an increasingly large problem in the future, as on-chip memory is going to remain expensive even as the number of cores on a chip increases. And in the real world, memory access times are often the limiting factor in execution time for a computer program.

The problem with parallelizing an NDFA the naive way I described earlier is going to be the overhead of process creation and destruction. To solve this, imagine the following: we have an effectively infinite number of cores, each with a small per-core memory and a larger per-chip memory. To simulate an NDFA, execute the following steps:

start an NDFA-modeling process on each core, and choose one as the "leader" process.
let each process follow transitions simultaneously on the FA being modeled. They read instructions from the per-chip memory. If the transition is non-deterministic, the leader process divides the affected processes among the different transitions. If the transition is deterministic,
Loop step 2 until input is exhausted.
Check if any processes are in an accept state. If so, then accept.

This approach will allow use to utilize our many, many cores as a means to reduce memory usage, which means less costly fetches from more "distant" memory (RAM, hard disk) and effectively reduces run time. A clever implementation could remove the concept of "leader" process so no core-to-core communication is required. The only problem here, of course, is that the number of cores isn't really infinite, so we will need some ugly tricks to make sure we don't run out of cores.

Regular expressions get used all over the place, and there are known methods for converting from regular expression to NDFA. It would be mighty neat to have a more memory-lightweight Perl, though I'm not sure whether physics/gaming/business endeavors use a lot of NDFA models. However, if I'm right that there are methods for finding the non-determinism in FAs, then we can automatically seek out non-determinism and take advantage of it.

The Second Family: One-Stack Push-down Automata

One-stack push-down automata (generally, PDAs) and their non-deterministic brethren (NDPDAs) speak context-free grammar (CFGs).

While I haven't seen any method for maximizing non-determinism in a NDPDA, I do know that in real life, we see quite a bit of non-determinism when converting from CFGs to NDPDAs, so I haven't dedicated a whole lot of thought to figuring out how to maximize non-determinism or minimize the number of states.

So in response to question 1: We can definitely parallelize using the non-determinism of NDPAs in a way that reduces run-time.

In response to question 2: It's probably useful to bother parallelizing various implementations of PDAs.

Here's the gist of it: PDAs are used for recognizing items generated by a context-free grammar. They are also used in real life for generating "parse trees" which demonstrate how expressions in a CFG were derived. This is the first step in both compilation and interpretation, and is hugely useful in computer science.

In short, most textbooks (and numerous places on the internet) explain simple methods for converting a CFG to a NDPDA. The non-determinism is easy to take advantage of here in parallel systems.

As an example, say we have the following grammar (highly contrived):

[S] := [A] | [B]
[A] := 010 [A]
[B] := 000

where S, A, B are non-terminals and 0,1 are terminal. (This example is contrived, but it should be good for now).

A simple NDPDA would start with the following:

Put S on the stack
Nondeterministically choose between one of the following: popping S and pushing B, or popping S and pushing 0, 1, 0, A.

When the first input is consumed, one of the two paths will error and the other will, if the string exists in the language, be accepted.

While this language is highly contrived indeed, this is much the kind of thing that happens with real-life parsers.

It is also important to note here that there are many families of CFGs; some are infinitely recursive in a way that deterministic models can't deal with, and parallel implementations won't be able to deal with either. So unlike when we discuss stackless FMs, we now have a distinction between parallel and non-deterministic.

So in response to question 2: So long as our context-free languages are not infinitely recursive, we can definitely parallelize. We descend potential parse trees all the time, and often have to reject them and try new ones. Descending multiple parse trees at once can potentially reduce a huge amount of parsing time.

Of course, parsing is often considered a one-time job. Most software gets parsed, compiled, and saved in a compiled form. But I strongly hope that parallel interpretation will help improve the latency of interpreted languages, as interpreted languages are generally more flexible and portable.

It would be particularly nice to see a language designed for maximally parallel interpretation. Further, it would even be possible to non-deterministically make assumptions about code as it is being parsed and attempt various optimized branches, throwing away those which are based on incorrect assumptions. With a sufficient number of cores, this approach might actually see some beneficial returns (whereas currently, optimized just-in-time compilation adds more latency than it removes).

Conclusions

In summary, I think that non-determinism and parallelism have an important but surprisingly complex relationship, and while many finite-state automata problems won't be improved by increasingly parallel technology, the "manycore" movement might speed up some problems (stackless FAs), will definitely speed up others (CFGs/PDAs) and hopefully encourage some new developments (better interpreted languages).

Short Intro to Finite-State Automata (FSA, DFA, and NDFA)

I'm adding this post so I can clarify my writing on the topic of parallelizing finite-state automata (see my previous entry) and so my non-CS friends can potentially try to figure out what I'm so interested in right now.

Wikipedia has a pretty good article, too.

Discrete Finite Automata: DFAs

An FSA (sometimes "finite-state machine," FSM) is an imaginary sort of machine which is a set of states and a set of transitions to other states. The article notes that collision detection is an FSA problem, but I've been raised on thinking of FSAs as being language parsing machines. A very simple FSA is a discrete finite automata (or DFA). A DFA has a finite number of states, and given an event - like finding a 0 or a 1 an "A" or a "B", the machine moves to another state.

An example is a machine which operates on the alphabet {A, B} and runs as long as it sees alternating "A"s and "B"s, e.g. ABABAB, but not ABBA. I.E., the machine runs on the alphabet {A, B} and accepts the language of zero or more instances of the string "AB."

We'll always start in a state called S0. In S0, there is a possible transition on A to S1 and on B a transition to S2. In S1, there is a transition on B to S2, in S2, we transition to S1. On all other events, transition to Serror, indicating that the machine is in a "reject" state. So a graph of this might look like this:

S0: A -> S1
B -> S2

S1: B -> S2
A -> Serror

S2: A -> S1
B -> Serror

Serror: A or B -> Serror.

So now if we start in S0 and feed the machine the inputs A, B, A it goes through states:
S0, S1, S2, S1.

If we feed the machine ABBA, it goes through states:
S0, S1, S2, Serror, Serror, and we know ABBA is not in our "language" of alternating As and Bs.

(I am simplifying a little here - technically all DFAs have an "accept" state as well as an error state, and there's an implicit "end of word" character. If you care, imagine that on "end of word", S1 and S2 transition to SAccept.)

It is also worth noting here that we can modify DFAs - for example, we can have DFAs with stack behavior (when taking input, they look at the top of an imaginary stack, and then push or pop something on/from the stack). DFAs with stacks are generally called "Push-down automata." PDAs can actually parse languages like C++, or Python, or Algol, or Lisp. They're very powerful indeed.

So in short, a DFA is a very powerful tool, and one that is generally well-researched, and if you believe that any math problem is actually a language problem (hint: it is), then a DFA can actually be created that solves pretty much any problem. And of course, because they are well-researched and easy to model on a computer, they get used quite a lot.

Non-Deterministic Finite-State Automata

There are also a whole family of FSA called non-deterministic finite state automata, or NDFA. These machines are generally somewhat difficult to conceptualize, because at any given time we need to imagine the machine as being in a set of one or more states. We might have a rule like this:

S0: A -> S1
A -> S2
B -> S3

i.e.: When in state S0, if given the input A, we now transition to states S1 and S2.

It is often said that when using an NDFA, these kinds of transitions invite us to think of the machine as splitting into several different machines, and if any of those machines reach an accept state, then the word is in our language.

Tuesday, February 20, 2007

In Response to Berkeley's "Landscape of Parallel Computing Research"

I recently read The Landscape of Parallel Computing Research: A View from Berkeley. It's a really great article, and it's freely available, and if you're interested in the future of high-performance computing or operating systems (or generally any computing that will be affected by the "manycore" movement in processor design) I strongly suggest it.

In the article, they decide on the thirteen major computation kernels which will be the basis of most problems in the future, essentially setting up the SPEC of the future. Of the thirteen, twelve are heavily parallel.

The thirteenth problem is modeling finite-state automata (though they use the words "finite-state machine" or FSM). If you don't know what FSAs are, I just posted a brief summary in this blog which should be just enough to help this entry make sense.

So now, back to the article. The Berkeley paper mentions the following in passing:

Perhaps FSMs will prove to be embarrassingly sequential just as [another problem] is embarrassingly parallel. (page 14).

I found this statement to be immensely interesting, and actually a major oversimplification. You can imagine that computing the state of an discrete finite machine doesn't lend itself to much parallelism. At any given time we are in exactly one state, and after one input, we move to exactly one more state.

And if FSAs are hopelessly sequential, then we have a whole family of problems that can't be solved any faster by an 80-core processor in 2010 than a 1-core processor in 2002.

However, this does not factor in the possibility that our FSA is not a DFA. There are also a whole family of FSA called non-deterministic finite state automata, or NDFA. These machines are generally somewhat difficult to conceptualize, because at any given time we need to imagine the machine as being in a set of one or more states. We might have a rule like this:

S0: A -> S1
A -> S2
B -> S3

i.e.: When in state S0, if given the input A, we now transition to states S1 and S2.

It is a known fact that any NDFA can be trivially converted back to a DFA, which is just wonderful for sequential programming. But a well-designed NDFA can be much simpler (far fewer states, fewer steps towards completion) than an equivalent DFA. And in fact, NDFAs are often more intuitive for a human to design - and can easily model things like lambda calculus, which requires non-deterministic parsing.

However, NDFAs aren't very popular because:

NDFAs can't do anything a DFA can't (because every NDFA has an equivalent DFA), and
There are very simple, well-researched methods for turning an NDFA into a DFA
There are simple known rules and methods for manipulating DFAs automatically
DFAs get along very nicely with sequential programming.

So historically, since DFAs are simple and get along quite well with sequential programming, NDFAs have, in my past education, been treated as sort of a footnote. I'm not even sure there exist known methods for converting a complex DFA into a simpler NDFA (though I'd imagine there are).

In conclusion, I refute the statement by Berkeley that FSAs are embarassingly parallel and thus problematic for the future. What we need is a lot more research into a tricky and complex topic, particularly methods for creating simpler NDFA (this might be very tricky - the point of being an AI problem) and tools that allow us to utilize multiple processors to model NDFAs (this should be pretty simple).

More on this in a later post.

mersenne twister party

arch detail