arch detail

arch detail

Saturday, August 04, 2007

Kororaa Linux

It's been a long few months!

I've been busy lately, hence the lack of updates. I'm back to work at LSST, integrating the Moving Object Pipeline System (which identifies objects in various images of the sky, then identifies common objects in images, then builds a database of their orbits and predicts their motion) into the overall LSST data framework.

Today, I took a break from it all to install Kororaa Linux. Kororaa is a Gentoo fork which has a LiveCD with full AIGLX/XGL environments, allowing all the nifty "wobbly windows" effects and OS X-like expose features. Since I now have an ATI Radeon card, I thought it might be a fun toy. Best of all, the install feature actually gets a whole running Gentoo system installed in under an hour, and pre-configured for super-slick video tricks.

Kororaa uses only open-source drivers, so it is morally correct, but unfortunately, the open-source Radeon drivers proved to be tragically slow. While I thought the lagginess of the system was related to running off a LiveCD, it didn't go away when I installed to a hard drive.

Furthermore, it didn't install cleanly. GRUB was set up incorrectly, and the xorg.conf was actually not quite right. I was willing to forgive all this - after all, GRUB hard disk nomenclature doesn't always coincide with Linux's, and the whole thing is based on Gentoo, so I could tweak the video drivers all I wanted. Or so I thought.

Unfortunately, 'emerge sync' managed to break Portage.

Three strikes, Kororaa. You were fun while you lasted. I might be back, someday, to manually install portage and try out some new video drivers, but don't count on it.

Sunday, May 13, 2007

Setting up my new Website with Joomla!

So I decided to build jonny-ash.com as a combination professional/personal resource. For employers, my site needs the following:

  • Contact information that is protected from spam-bots
  • Resume
  • Some "about me" information
  • References to this blog, which though unprofessional, does demonstrate my dedication to computer science.
For friends and family, I wanted:

  • Links to all my identities on different web resources (YouTube, MySpace, Flickr)
  • Aggregation (as much as possible) of those identities
This essentially really just boils down to a few use cases. I seriously considered doing it with PHP and calling it a day, but I decided to be adventurous and try getting out of the "coder" mindset and instead using an existing content management system - after all, this "Web 2.0" noise is essentially about using existing software over the web, rather than doing everything yourself. Since Joomla! has a canned installation system on DreamHost (my web host), it seemed like the clear choice.

Now that all is said and done, I'm actually fairly happy with Joomla! as a tool. It's a MySQL-backed system that does a good job of differentiating between style presentation, content, and navigation. You have a user-friendly backend which allows addition of content and drop-in style template changes (totally configurable, if you really care) and all the actual pages are generated with PHP. It's virtually seamless.

Of course, CSS allows us to separate style information from content in HTML, and this is generally good practice that is observed all over. But allowing navigation to remain distinct from content is something really handy, and it's great to be achieve it without using hack solutions (IMHO) like DreamWeaver.

It integrates neatly with Flickr and LiveJournal using Feed Gator as an extension. Blogger actually comes out pretty ugly because there's no "title" heading for Blogger entries, which makes everything pretty unreadable. There are tons of other components and modules that are freely available, as well, and some pretty slick templates.

So though it was overkill, and setup probably took twice what the PHP coding would have taken me, and although Joomla! is occasionally so complex that it gives me headaches, I'm pleased with my decision. Why? Well, mostly because now it takes .5 seconds to install a new theme and completely revamp my website's look. Having spent a summer to accomplish just that much for a college department in the old days, that's pretty danged satisfying.

Saturday, May 12, 2007

jonny-ash.com

Introducing jonny-ash.com.

I've finally invested in some real web hosting via DreamHost.

Wednesday, March 21, 2007

The ugly side of the "Web 2.0" approach

I recently did some work on my Google Summer of Code application. They want to know where my online resume is published. I put one up on Google docs and made it public. Now anyone can spider the Web and read my home address, phone number, and e-mail address. The e-mail addres was intentionally obfuscated from XYZ@gmail.com to "GMail: XYZ" in hopes that I won't be inundated with "VIxAy Gzzzz RA" and "hot nuked grils" in my inbox.

So now I find that I have the following:

So now we have tools like OpenID, but of the above, LiveJournal is the only one that supports it, so that helps, uh, none. Friend-of-a-friend is a machine-friendly method for describing relationships between people (like how we have "friends" on Facebook, and "friends" on LiveJournal, and "friends" flickr.) But again, nobody uses it, so that doesn't help.

So I could list on LJ, Facebook, Blogger, YouTube... etc, every one of this increasingly large list of content hosts. This is annoying, and it also means that a potential employer who reads my blog has to go to my Blogger profile to find my resume, and they'll also find things like my personal music work (which, honestly, isn't something that makes me feel at ease.) And really, I want my friends to know about my YouTube videos and music and such. And I shouldn't have to obfuscate my actual contact information in my resume or expect employers to use Facebook (heh).

So I'm now contemplating going whole-hog before my inbox gets spammed to death and just buying some web space and registering, say, jonny-ash.com. Here's what I want for myself:

Some web service which will keep track of my online identity in all its forms. Something that says "I am X on OpenID, Y on Blogger, Z on Flickr." I also want it to allow me to present myself in a professional way while keeping added goodies for people who want to know about me personally, and if possible, I'd like to restrict access to the personal stuff to friends. And lastly, I'd like it to have actual contact information that is protected from spiders and crawlers that sell my information to spammers.

The thing is, a service which *does* all this is totally possible. Nobody's done it yet.

I could.

Here's how it would work: It should support open standards like OpenID and FOAF. When a user logs in, they can set up an account with links to their various web resources. They should be able to add contact information and make it public or private but hide it behind a "verify that you are human" image so spiders can't read it. They should also be able to hide other information behind FoaF, as in, my "friends" and their "friends" (ideally, whether they are "friends" on Facebook, or LJ, or...) should be able to see my Aikido videos and music recordings. Other people shouldn't.

All of this should be pretty trivial to do, and I think I could get the web hosting for, say, $6 a month. I could sell advertising.

Would anyone be interested? Does anyone care? Should I just set up a personal website for myself and let everyone else do the same...?

Thursday, March 08, 2007

Finally, how to Really Parallelize *Any* Stackless DFA

I posted a little while ago about my idea for parallelizing a Huffman Decode. What I didn't realize at the time is that this could be generalized to parallelize any stackless DFA. Here's a simple writeup of a method for calculating the end state of a stackless DFA with input A, on a computer with infinite processors:

Break down the input, A, into N equal (or nearly equal) substrings, A1, A2... AN.

Next, create groups of processors, G0, G1, G2... GN. GN will be one processor, which will start in the start state of the automaton. Each Gi (where i > 0) will be run on groups of S processors, G1, G2... GN, where S is the number of states in the machine, and each processor in that group will start in a different state of the machine.

Each processor in Gi will process the input Ai.

Call G0 "finished" when it has either reached a "finish" state (e.g. SError, or, say, a "terminal" leaf in a tree). Call Gi where i > 0 "finished" when a processor in Gi-1 has been contacted by a prior group and that processor has finished processing all of its input.

When Gi finishes, it is either in a "finish" state (i.e. SError) or another state. If it is in a "finish" state, return this state. If it is in another state, it contacts the processor in Gi+1 which started in the state where Gi finished. The state returned by GN is the final state of the automaton.
Simple enough, right? Funny that I haven't seen this before - and that I hadn't thought of it sooner.


Realistic Applications

Of course, the real drawback here is that for speedup N, we need (N-1)*(S) processors (where S is the number of states in the machine). If we have very few states and a huge input, this could potentially be useful. However, if we have a huge number of states, we will require an even more huge number of processors to get a speedup. Furthermore, this is going to generate a lot of wasted cycles and drive up electricity bills, which is an increasing concern.

However, if we take this general method as a model for dealing with parallel DFA modeling, then it starts to get pretty useful. For example, in the Huffman Decoding method I described in a prior post, we know that the DFA is actually a tree, so group Gi only need to simulate the subtrees which start at a given depth (i.e. i * [len(A0) + len(A1) + ... + len(Ai-1)]), which is likely to be much smaller than the number of states in the DFA.

More generally, if we know beforehand what states are reachable after a given number of inputs (which can be computed easily), we can often reduce the number of processors which must be used.


Long-term, Pie-in-the-Sky Approaches

The approach I described before is actually very similar to some forms of branch prediction. A certain branch-predicting processor might look at a branch that happens on a conditional, then actually start computing two different trains of logic - one based on the assumption that the conditional will evaluate to "true," and one on the assumption that the conditional will evaluate to "false" (okay, so most processors are a little more complex, but that's the idea). When the conditional is evalutated, one of these strings of computation is thrown out and the results of the other are retained.

Imagine if our comnputer knew that a certain computation-heavy function would return one of X different values, and simply assigned a processor to make all of these assumptions, and the results from one of the X processors was retained. It's not all that different from the method I've described for simulating DFAs.

Granted, at this point, this approach is only useful if we know that the function will return one of a handful of known values. But what if we effectively had an infinite number of processors to throw at this problem? Say that we had a function which returns a 4-byte integer. That's quite a few possibilities, but what if we still just assigned a processor to each of them? If the number of cores in a processor doubles every 18 months, then we might actually be able to do something like this - and while it would be a dumb way to parallelize all our sequential code, it would still be possible to do entirely at the hardware level.

In the meantime, more formal languages allow us to specify the exact range of values a function could return. Again, this could make such a "shotgun" approach as I have described much less wasteful.

D-Wave's Quantum Computing Not Totally Legit

I've been trying to learn about that damned quantum computing demonstration from D-Wave for a while now. Ars Technica's article is the first one that's satisfied me.

The most important detail: Quantum computing, at least as D-WAVE has it, does not mean that NP-complete problems will be solvable in polynomial time. I had heretofore been under the impression that they might be.

Thus, computer scientists will not be rendered obsolete.

Thank you, Ars Technica, for clearing that up.

OpenBox is Great

I've grown disenfranchised with FluxBox. I'd been loyal, but I had some very strange issues with X/FluxBox going into a nasty freeze and the FB community was less-than-helpful.

I've tried Blackbox, and I have to say, it's too damned ugly. Admittedly, I never tried very hard at fixing that.

So today I tried OpenBox. It is a breath of fresh air. It is beautiful. It gets all my fonts right. And the more I learn about it, the better it seems. I think my first feeling of excitement came from reading the OpenBox "about" page:

Openbox works with your applications, and makes your desktop easier to manage. This is because the approach to its development was the opposite of what seems to be the general case for window managers. Openbox was written first to comply with standards and to work properly. Only when that was in place did the team turn to the visual interface.
Finally, a WM with it's head on straight.

The behavior is way more configurable than any WM I've seen before, because anything you can imagine is specifiable through an XML configuration file. While the difficulty of setting up a simple menu with XML is not easily overlooked, tools like denu greatly simplify the process.

One of the things that most impressed me was that they made a point of complying with freedesktop.org standards. If the post-fd.o world is really this well-done, the Free world will finally be a wonderful thing.

Tuesday, February 27, 2007

The Usable Linux Softsynth Roundup

I've recently been blown away by the incredible sound expressiveness of this open-source synthesizer, zynaddsubfx. I've always known it had some really great sound possibilities, but I recently started actually studying its inner workings, and I'm completely amazed. If anyone here is interested in learning about making beautiful noises, definitely check it out.

The biggest deal for me is that the creator has invented something that appears to be a (as far as I know) new type of sound synthesis. He calls it PADSynth, and there's even a complete description of the algorithm, albeit in broken English, on the website. The brilliance of it is that it takes advantage of the fact that an inverse FFT always generates an effectively "windowed" sample, which for audio purposes means completely seamless, click-free looping.

The results are sounds that are simultaneously warm and rich, but with the possibility for bizarre, inorganic harmonics that when transport the listener to a bizarre alien landscape. I seriously lost several hours just listening to chords built from sawtooth waves and missed my Aikido class.

The only problem is that because you have to design the sample, then generate a gargantuan array with an inverse FFT, it's not entirely possible to hear your edits in realtime. I think that a reasonable imitation could be made if a low-priority thread simply recomputed a new sample every time the user made an edit and applied it whenever it was finished, though this might introduce an impossibly large CPU overhead on slower machines. It would make a nice optional feature for us SMP users, though.

I've also been examining a few other open-source synths, but none have impressed me that much.

Bristol has a terrible installer (no autotools, just a script!) and the sounds are somewhat lacking imitations of analog synths of yore.

Ceres3 does spectral modeling, has a dinky install environment, was built for IRIX, seems to be slowly joining the Linux world by way of PPC(!), and has no intent of being more than an educational tool. Worst of all, they even note that the source was "written in Vim." Ick.

Mammut is a similar spectral modeling tool just like Ceres3, but not a real-time synthesizer, per se, and also lacks a maintainer, requires absurd libraries I've never heard of, and uses goofy install tools.

ATS is a better-looking spectral modeler, but no JACK support. D'oh.

RTSynth is nothing but a toy - it for some reason tries to replace JACK and LADSPA with its own incompatible and limited solutions. Also a terrible installer.

Monday, February 26, 2007

A Parallel Huffman Coding Algorithm

In That Berkeley Article, I found the following:

Although 12 of the 13 Dwarfs possess some form of parallelism, finite state machines
(FSMs) look to be a challenge, which is why we made them the last dwarf. Perhaps FSMs
will prove to be embarrassingly sequential just as MapReduce is embarrassingly parallel.
If it is still important and does not yield to innovation in parallelism, that will be
disappointing, but perhaps the right long-term solution is to change the algorithmic
approach. In the era of multicore and manycore. Popular algorithms from the sequential
computing era may fade in popularity. For example, if Huffman decoding proves to be
embarrassingly sequential, perhaps we should use a different compression algorithm that
is amenable to parallelism.
Well, I thought about this for a bit, and I realized there are simple methods for using parallelization for Huffman decoding, provided there are sufficiently many nodes.

A quick Google search trivially demonstrates this fact. However, oddly, I can't find any freely available description of the technique they published.*

If you think about it, though, parallel Huffman Decoding shouldn't be hard if we have a suitably large number of processors, and we assume that communication time between processor elements is very short.

With Huffman coding, variable-length strings of 1s and 0s describe a path down a binary tree. The binary tree is uneven in length and some terminal leaves are closer than others (the closer leaves are for data items which are most commonly accessed).

I'll try to describe how this might work in fairly plain English and not use much formality, because it's pretty simple, really.

We require that our tree be implemented such that we can instantly calculate the location of a given subtree given its depth and which subtree at that level it is (the first, or second, or third...). This is possible if the tree is implemented as an array and unused slots are filled with a NULL character. (I'm not sure about the term for this implementation, but I've used it before).

All we need to do is choose a smallish integer, N, which will decide the amount of parallelism. Let's say our N is 4.

One processor will constitute the first set of workers, 2^4 = 16 processors will constitute the second, and 2^8 will constitute the third, and 2^12 will constitute the fourth, etc.

At time = 0, set one (the main processor) will decode the first N bits (4) of the coded tree.

Also at time = 0, each processor in set two looks at a different subtree starting at depth 4. Since our coded tree is implemented correctly, we do not need to descend the tree in order to find the subtrees, but can access them more or less instantly. There are at least as many processors as subtrees, and if we run out of subtrees, then some processors are inactive.

Also at time = 0, each processor in set three looks at a different subtree starting at depth 8. If there are more processors than subtrees, then some processors are inactive.

This continues for each set down the line, theoretically ad infinitum or until we run out of processors.

At time = 1, set one (the first processor) will have either found a terminal, or have found a subtree. That subtree will have been decoded by a processor in set two. That processor in set two will have found either a terminal or a subtree. That subtree will have been decoded by a processor in set three. That processor will have found a terminal or a subtree...

And so on.

So at time = 1, we just need to query processors as to whether they got a terminal or another subtree, then repeat until we get a terminal.

So our runtime is at N * (time to descend a level of a binary tree) + floor(total depth of the terminal / N) * (time to communicate result between two processors). Because of the way Huffman coding works, we can expect that the total depth of the terminal is small.

I'm pretty sure that this sorks. If anyone asks, I might write this up in a more formal way, but this seems so intuitive that I doubt it's worth the effort.

So in conclusion: Huffman coding is not embarrassingly sequential. In fact, its neatly parallel - if you have infinite processors and don't care about using them, anyway - and if this wasn't freely published before, it is now.



*Note: I haven't read any of these papers. The technique described here is entirely my own invention. It may or may not be similar to theirs.

Friday, February 23, 2007

Musings on Parallelism in Finite-State Automata

In a recent post, I responded to Berkeley's statement that FSA modeling is "embarassingly sequential," i.e. will not benefit from parallelization. (If you're unfamiliar with FSA, check out my introduction, which also links Wikipedia.

In this article, I'm addressing possible grounds of parallelization in the simpler two families of FSA: stackless finite-state automata (deterministic and non-deterministic finite-state automata, DFAs and NDFAs), and finite-state automata with one stack (push-down automata, PDAs). The final family, two-stack PDAs, which are equivalent to Turing machines, and are just too large a problem for me to address, at least for now.

I'll try to address these families of machines with regards to the following questions:

  1. Where and when can we parallelize? Are there formal methods for establishing parallelism?
  2. How can we make this parallelism useful, and what applications could benefit?

The First Family: Stackless Finite Automata

"Normal" DFAs/NDFAs don't have stacks; they can recognize regular expressions.

I was initially shocked to hear that Berkeley would say that FSM modeling can't be parallelized. This is partly because any introductory textbook notes that DFAs and NDFAs are equivalent in power, and uses the word "parallel" to describe the way that NDFAs work. With an NDFA, when we "non-deterministically" pursue two transitions on the same input, we can essentially think of our machine splitting into two idential copies, one of which goes through the first transition, one of which goes through the other. The two machines run in parallel. This would be easy to implement on a parallel-processor machine: you just spawn multiple threads.

In practice, though, this is not how NDFAs get implemented. They usually just get converted to DFAs. There is a known method for converting an NDFA to a corresponding DFA. This can lead to an exponential increase in number of states.

But this actually doesn't affect runtime. It seems intuitive that having exponentially fewer states would lead to exponentially faster runtimes, but it doesn't. In an NDFA , parsing 01110 takes five transitions (possibly more if you allow epsilon-transitions). In a DFA, parsing 01110 takes five transitions. This will be true for all strings. The minimum number of transitions is based on the number of input items. This is trivial to prove and I'm amazed I didn't realize it sooner.

There do exist methods for reducing NDFAs regarding the number of states and transitions. (See also: this one). They are NP-hard, but they could be accomplished.

I don't think the above maximizes non-determinism, but in most cases I would imagine that adding non-determinism is the best way to reduce the number of states. I'm also all but sure that for the same regular expression, the minimal NFA has as many or fewer states than the minimal DFA.

So Berkeley is essentially correct in this case - stackless DFAs and NDFAs are "embarrassingly sequential" regarding runtime. We could, however, reduce memory usage for some problems if we used NDFAs.

So in response to question 2, parallelizing FAs could be useful in the memory domain. Memory usage is going to be an increasingly large problem in the future, as on-chip memory is going to remain expensive even as the number of cores on a chip increases. And in the real world, memory access times are often the limiting factor in execution time for a computer program.

The problem with parallelizing an NDFA the naive way I described earlier is going to be the overhead of process creation and destruction. To solve this, imagine the following: we have an effectively infinite number of cores, each with a small per-core memory and a larger per-chip memory. To simulate an NDFA, execute the following steps:

  1. start an NDFA-modeling process on each core, and choose one as the "leader" process.
  2. let each process follow transitions simultaneously on the FA being modeled. They read instructions from the per-chip memory. If the transition is non-deterministic, the leader process divides the affected processes among the different transitions. If the transition is deterministic,
  3. Loop step 2 until input is exhausted.
  4. Check if any processes are in an accept state. If so, then accept.
This approach will allow use to utilize our many, many cores as a means to reduce memory usage, which means less costly fetches from more "distant" memory (RAM, hard disk) and effectively reduces run time. A clever implementation could remove the concept of "leader" process so no core-to-core communication is required. The only problem here, of course, is that the number of cores isn't really infinite, so we will need some ugly tricks to make sure we don't run out of cores.

Regular expressions get used all over the place, and there are known methods for converting from regular expression to NDFA. It would be mighty neat to have a more memory-lightweight Perl, though I'm not sure whether physics/gaming/business endeavors use a lot of NDFA models. However, if I'm right that there are methods for finding the non-determinism in FAs, then we can automatically seek out non-determinism and take advantage of it.





The Second Family: One-Stack Push-down Automata

One-stack push-down automata (generally, PDAs) and their non-deterministic brethren (NDPDAs) speak context-free grammar (CFGs).

While I haven't seen any method for maximizing non-determinism in a NDPDA, I do know that in real life, we see quite a bit of non-determinism when converting from CFGs to NDPDAs, so I haven't dedicated a whole lot of thought to figuring out how to maximize non-determinism or minimize the number of states.

So in response to question 1: We can definitely parallelize using the non-determinism of NDPAs in a way that reduces run-time.

In response to question 2: It's probably useful to bother parallelizing various implementations of PDAs.

Here's the gist of it: PDAs are used for recognizing items generated by a context-free grammar. They are also used in real life for generating "parse trees" which demonstrate how expressions in a CFG were derived. This is the first step in both compilation and interpretation, and is hugely useful in computer science.

In short, most textbooks (and numerous places on the internet) explain simple methods for converting a CFG to a NDPDA. The non-determinism is easy to take advantage of here in parallel systems.

As an example, say we have the following grammar (highly contrived):

[S] := [A] | [B]
[A] := 010 [A]
[B] := 000
where S, A, B are non-terminals and 0,1 are terminal. (This example is contrived, but it should be good for now).

A simple NDPDA would start with the following:

  1. Put S on the stack
  2. Nondeterministically choose between one of the following: popping S and pushing B, or popping S and pushing 0, 1, 0, A.

When the first input is consumed, one of the two paths will error and the other will, if the string exists in the language, be accepted.

While this language is highly contrived indeed, this is much the kind of thing that happens with real-life parsers.

It is also important to note here that there are many families of CFGs; some are infinitely recursive in a way that deterministic models can't deal with, and parallel implementations won't be able to deal with either. So unlike when we discuss stackless FMs, we now have a distinction between parallel and non-deterministic.

So in response to question 2: So long as our context-free languages are not infinitely recursive, we can definitely parallelize. We descend potential parse trees all the time, and often have to reject them and try new ones. Descending multiple parse trees at once can potentially reduce a huge amount of parsing time.

Of course, parsing is often considered a one-time job. Most software gets parsed, compiled, and saved in a compiled form. But I strongly hope that parallel interpretation will help improve the latency of interpreted languages, as interpreted languages are generally more flexible and portable.

It would be particularly nice to see a language designed for maximally parallel interpretation. Further, it would even be possible to non-deterministically make assumptions about code as it is being parsed and attempt various optimized branches, throwing away those which are based on incorrect assumptions. With a sufficient number of cores, this approach might actually see some beneficial returns (whereas currently, optimized just-in-time compilation adds more latency than it removes).


Conclusions

In summary, I think that non-determinism and parallelism have an important but surprisingly complex relationship, and while many finite-state automata problems won't be improved by increasingly parallel technology, the "manycore" movement might speed up some problems (stackless FAs), will definitely speed up others (CFGs/PDAs) and hopefully encourage some new developments (better interpreted languages).

Short Intro to Finite-State Automata (FSA, DFA, and NDFA)

I'm adding this post so I can clarify my writing on the topic of parallelizing finite-state automata (see my previous entry) and so my non-CS friends can potentially try to figure out what I'm so interested in right now.

Wikipedia has a pretty good article, too.

Discrete Finite Automata: DFAs



An FSA (sometimes "finite-state machine," FSM) is an imaginary sort of machine which is a set of states and a set of transitions to other states. The article notes that collision detection is an FSA problem, but I've been raised on thinking of FSAs as being language parsing machines. A very simple FSA is a discrete finite automata (or DFA). A DFA has a finite number of states, and given an event - like finding a 0 or a 1 an "A" or a "B", the machine moves to another state.

An example is a machine which operates on the alphabet {A, B} and runs as long as it sees alternating "A"s and "B"s, e.g. ABABAB, but not ABBA. I.E., the machine runs on the alphabet {A, B} and accepts the language of zero or more instances of the string "AB."

We'll always start in a state called S0. In S0, there is a possible transition on A to S1 and on B a transition to S2. In S1, there is a transition on B to S2, in S2, we transition to S1. On all other events, transition to Serror, indicating that the machine is in a "reject" state. So a graph of this might look like this:

S0: A -> S1
B -> S2

S1: B -> S2
A -> Serror

S2: A -> S1
B -> Serror

Serror: A or B -> Serror.

So now if we start in S0 and feed the machine the inputs A, B, A it goes through states:
S0, S1, S2, S1.

If we feed the machine ABBA, it goes through states:
S0, S1, S2, Serror, Serror, and we know ABBA is not in our "language" of alternating As and Bs.

(I am simplifying a little here - technically all DFAs have an "accept" state as well as an error state, and there's an implicit "end of word" character. If you care, imagine that on "end of word", S1 and S2 transition to SAccept.)

It is also worth noting here that we can modify DFAs - for example, we can have DFAs with stack behavior (when taking input, they look at the top of an imaginary stack, and then push or pop something on/from the stack). DFAs with stacks are generally called "Push-down automata." PDAs can actually parse languages like C++, or Python, or Algol, or Lisp. They're very powerful indeed.

So in short, a DFA is a very powerful tool, and one that is generally well-researched, and if you believe that any math problem is actually a language problem (hint: it is), then a DFA can actually be created that solves pretty much any problem. And of course, because they are well-researched and easy to model on a computer, they get used quite a lot.


Non-Deterministic Finite-State Automata



There are also a whole family of FSA called non-deterministic finite state automata, or NDFA. These machines are generally somewhat difficult to conceptualize, because at any given time we need to imagine the machine as being in a set of one or more states. We might have a rule like this:

S0: A -> S1
A -> S2
B -> S3

i.e.: When in state S0, if given the input A, we now transition to states S1 and S2.

It is often said that when using an NDFA, these kinds of transitions invite us to think of the machine as splitting into several different machines, and if any of those machines reach an accept state, then the word is in our language.

Tuesday, February 20, 2007

In Response to Berkeley's "Landscape of Parallel Computing Research"

I recently read The Landscape of Parallel Computing Research: A View from Berkeley. It's a really great article, and it's freely available, and if you're interested in the future of high-performance computing or operating systems (or generally any computing that will be affected by the "manycore" movement in processor design) I strongly suggest it.

In the article, they decide on the thirteen major computation kernels which will be the basis of most problems in the future, essentially setting up the SPEC of the future. Of the thirteen, twelve are heavily parallel.

The thirteenth problem is modeling finite-state automata (though they use the words "finite-state machine" or FSM). If you don't know what FSAs are, I just posted a brief summary in this blog which should be just enough to help this entry make sense.

So now, back to the article. The Berkeley paper mentions the following in passing:

Perhaps FSMs will prove to be embarrassingly sequential just as [another problem] is embarrassingly parallel. (page 14).
I found this statement to be immensely interesting, and actually a major oversimplification. You can imagine that computing the state of an discrete finite machine doesn't lend itself to much parallelism. At any given time we are in exactly one state, and after one input, we move to exactly one more state.

And if FSAs are hopelessly sequential, then we have a whole family of problems that can't be solved any faster by an 80-core processor in 2010 than a 1-core processor in 2002.

However, this does not factor in the possibility that our FSA is not a DFA. There are also a whole family of FSA called non-deterministic finite state automata, or NDFA. These machines are generally somewhat difficult to conceptualize, because at any given time we need to imagine the machine as being in a set of one or more states. We might have a rule like this:

S0: A -> S1
A -> S2
B -> S3

i.e.: When in state S0, if given the input A, we now transition to states S1 and S2.

It is a known fact that any NDFA can be trivially converted back to a DFA, which is just wonderful for sequential programming. But a well-designed NDFA can be much simpler (far fewer states, fewer steps towards completion) than an equivalent DFA. And in fact, NDFAs are often more intuitive for a human to design - and can easily model things like lambda calculus, which requires non-deterministic parsing.

However, NDFAs aren't very popular because:
  • NDFAs can't do anything a DFA can't (because every NDFA has an equivalent DFA), and
  • There are very simple, well-researched methods for turning an NDFA into a DFA
  • There are simple known rules and methods for manipulating DFAs automatically
  • DFAs get along very nicely with sequential programming.
So historically, since DFAs are simple and get along quite well with sequential programming, NDFAs have, in my past education, been treated as sort of a footnote. I'm not even sure there exist known methods for converting a complex DFA into a simpler NDFA (though I'd imagine there are).

In conclusion, I refute the statement by Berkeley that FSAs are embarassingly parallel and thus problematic for the future. What we need is a lot more research into a tricky and complex topic, particularly methods for creating simpler NDFA (this might be very tricky - the point of being an AI problem) and tools that allow us to utilize multiple processors to model NDFAs (this should be pretty simple).

More on this in a later post.

Thursday, November 30, 2006

Why I'm Not Learning .NET, yet


visstudio
Originally uploaded by pazuzuzu.
So, I'm looking for employment in Ohio, or via the internet somewhere else. It seems that everybody is looking for a senior C# programmer. Must have history using Visual Studio/ .NET.

So, I think, time to look into learning C#. And Visual Studio. You see the results at right: Microsoft's top hit is a .NET-powered "internal server error."

Also of note: One of the top ten Google results was "Does Visual Studio Rot The Mind?"

Wednesday, October 18, 2006

Output from IBM's xlc for Cell BE

"flopsie_spu.c", line 56.5: 1506-068 (S) Operation between types "vector float" and
"<><><> Illegal tVectorType!!!" is not allowed.


Oh, IBM xlc, you always make me smile.

Saturday, October 14, 2006

The Cell BE: What works, what doesn't

There's no shortage of talk out there on whether or not the Cell will ever take off. I've been fortunate enough to get some time actually coding it, and I think it's important that I get my personal views out there. The Cell really does present some unique challenges and some unique opportunities for the computing world at large, and its fate will likely decide the future of IBM, and possibly Sony and Toshiba as well.

This is likely to be long, so I'm breaking it down into three areas:

An Introduction briefly defines the Cell architecture and what critics have said about it (i.e., hype and response).

Cell in Practice is the section on the nitty-gritties of how to programming for the Cell within the paradigm provided by IBM. It's intended for programmers and based on my personal experiences, and if you're not a programmer, feel free to skip it.

Conclusions is written for anyone who wants to know what the Cell means to them and the future of the computer industry. Programmers might appreciate it, but you shouldn't need any serious technical knowledge to appreciate it.




An Introduction: What has already been said

The Cell architecture has been hyped and misunderstood on a level few bits of technology ever have. I blame, at least partially, the video game industry and mainstream media for making a lot of noise without much understanding.

But here's the overview of what Cell is:

The Cell architecture is IBM/Sony/Toshiba's new joint investement. It's a radical departure from conventional CPU architectures as we know them today. Rather than one individual CPU on the chip, it actually contains a PPE (Power Processor Element, basically a simplified G5) and eight SPEs (Synergistic Processing Elements, vectorized floating-point optimized processors) which are all connected on a high-bandwidth bus (Element Interface Bus, EIB).

Sony/IBM/Toshiba claim that the Cell will make truly next-gen performance feasible, *and* do it with less power intake than conventional CPUs, which have stagnated in recent years. The low power consumption means that the Cell might be well-suited to making next-generation multimedia and encryption technology possible within handheld devices like cell phones, PDAs and consumer electronics devices like HDTVs. And, of course, the huge perfomance increases for video and multimedia made it a logical choice for the Playstation 3. And eventually, it might even take the desktop computer world.

But this radical departure means that to get better performance out of any application, it has to be hand-recoded, paralellized and retuned for the Cell architecture. You can recompile your app for the new architecture, but it will only use the PPE, but it'll actually run slower than on an older G5.

And naturally, this is the main argument made against any hope of the Cell taking off: every architecture in history has taken off because of the amount of legacy software it can run; how else can you explain the way the monstrous Intel x86/Windows paradigm has conquered the world? Any processor that actually lowers performance for most applications - and won't run, say, Windows or MS Office, erm, anytime soon - doesn't seem to be a likely candidate for the future of computing.




Cell in Practice: A practical example of programming the Cell BE

I've had the good fortune of exploring the performance of the Cell for a week now, and it's been a sometimes-frustrating but ultimately rewarding experience. In fact, I used IBMs vectorized 1D FFT to create my own 2D FFT library (not quite finished, hopefully soon to be on SourceForge.) Here's how the real programming paradigm works out, at least for me, and the associated pitfalls:

IBM has released a SDK and system simulator under the Common Public License. The SDK contains a few helpful examples, though the code can be a little perplexing at times. They also will provide you with modified versions of GCC and associated GNU tools, as well as the IBM XLC compilers for Cell, free of charge.

Perhaps most importantly, there are example Makefiles in the SDK. You'll want to look long and hard at those. Since the PPE and SPEs are effectively different architectures, you'll need to write and compile separate code for both of them, then *link* the SPE program into the PPE program so that the PPE can spawn SPE threads on the 8 SPEs.

This is where we start to run into some of the hassles of coding for the SPEs. The SPEs keep everything in "Local Store", which is basically an L2 Cache. The size of the LS on the Cell BE is 256KB/SPE, but that can change on future Cell models. Furthermore, you'll have to explicitly manage movement between main memory and Local Stores - and you'll have to factor in the size of your running code when you calculate your memory requirements. There is nothing like virtual memory to help you. I understand that if you've done programming for a GPU, this isn't anything new, but it was very new to me.

You'll have to figure out what you want those SPEs doing, of course, and you need to put them to good use if you want to get any speedup out of this architecture. Fortunately for me, a 2D FFT is well-suited; just break up the image into 8 column-wise segements, then distribute them from main memory to SPE Local Stores with IBM's Direct Memory Access (DMA) library and process them one column at a time on the SPEs, then move them back to the main store using DMA.

A quick pitfall of using IBM's DMA: SPEs cannot access any data that is not allocated in a global scope in C. This took quite a while for me to figure out, and it means that you'll have to get past the "good programming practices" verboten on global allocation and statically allocate a huge buffer for communicating with the SPUs.

To get peak performance for DMA transfers, IBM claims that carefully using their other C extensions like __attribute__ ((alignment (128))) will speed up DMAs and vector ops, but I didn't personally find this to be the case - the compilers seemingly took care of this for me, as my benchmarking experiments didn't seem to show any difference when I attempted this level of optimization.

The next step to optimizing this one was to use a vectorized 1D FFT. IBM provides you with C/C++ extensions for declaring vector data types and a bunch of vector math functions that do a straight 1-to-1 mapping from C functions like vec_add or spu_add straight to the corresponding assembly functions. Getting them set up is definitely a tad confusing, though; you'll need to include header files and make sure to use the IBM-distributed compilers with the correct option flags, and any erroneous usage of vector calls will result in completely perplexing compiler messages.

Fortunately for me, IBM provide a vectorized 1D FFT, unhelpfully provided under the misnomer fft_2d, and they give an example in the documentation.

Shockingly, their example of fft_2d usage in the users guide would actually cause an "unexpected internal compiler error" in spuxlc, the XLC compiler. Fortunately, when I informed IBM on the developer's forums, they were quick to respond that the example was bad (it required more memory than one could possibly allocate on an SPE) and promise that the compiler would be fixed by the next drop, but I never got any word on an improved example for the users guide. It was the first hint I've seen that IBM must be getting desperate - the Cell Blade server is on the market, and it's a little late to play the "it's just pre-release!" game with users now.

After some initial testing, it became obvious that DMA was a huge portion of my execution time, because of the fact that I was waiting on data every time I requested it, my DMA transfers were becoming a huge burden. Requesting 4-6 accessed at a time and then waiting on the data only when I absolutely needed it ended up giving me something along the lines of a 6x speedup.

Now comes the next phase of development: Starting over. Why? Well, in my case, it turns out that the vectorized FFT approach means I don't have memory for double-buffering, and it sure looks like vectorization in this case is a lot less helpful than lowering my DMA wait times. So call my library incomplete for now.

For all this work, was it worth the effort?

Well, consider my surprise when I compared my dinky, incomplete FFT lib's performance to the performance of FFTW (the Fastest Fourier Transform in the West) running on Itanium (at NCSA's TeraGrid site) and on my local 2.8 GHz Pentium 4.

The results?

I beat them both. Yep. My first attempt at any kind of FFT library, and also my first attempt at programming the Cell (and this from a guy who's never done graphics *or* embedded systems, the two things that would make you well-qualified for Cell programming) was enough to beat out the highest stage of FFT evolution on Intel architectures. And my library only stands to get better.

Bearing in mind that FFTs are one of the most heavily-used tools in science and multimedia (both sound and video), this is no small news. Not everyone gets to knock out the champ on their first shot.

On the other hand, this is purely an in-core FFT - FFTW will scale to huge resolutions independent of L2 cache size. My version will have to discard vectorization for anything above 2048 x 2048 (which I might do anyway) and then will need to completely rewritten for anything that exceeds 16K x 16K. This is entirely because of the fact that the SPEs are completely restricted to local store, and dealing with memory usage beyond those bounds is completely the responsibility of the programmer.

So if you want to use this library for larger-sized 2D FFTs, you'll need to either wait for IBM to release a Cell with a larger Local Store - size is not specified in the design specs; so Local Store size could increase or decrease at any time - or use another library.

There are some other funny caveats for programmers, and with some funny implications for the consumer world.

First: Statically compiled libraries are the only ones you can use in the SPEs. This is because the libraries *and* your code *and* your data all have to live on the Local Store. The biggest restriction on performance in the SPEs, at least in my experience, is the size of the Local Store. Thus, choice of libraries and choice of compiler are both going to have significant effects on how much space you have left to allocate for actually doing work. This is certainly going to require some interesting programming approaches, and I highly doubt it will make for more readable code!

Further, it also means that if all your library, bar, contains a statically compiled library foo, you'll have to recompile bar every time foo is upgraded. And you can imagine that this will quickly get out of control.

Second: Be careful about your assumptions regarding how much data space you can use in the SPEs. The size of the Local Store could change at any time with new models of Cells. This might make your life easier if Local Stores get bigger. It might also mean that you run into a lot of trouble if your library needs to run on a scaled-down Cell for portable devices.

Third: Don't kid yourself about using a high-level language for optimized coding on the Cell. It just won't ever happen. However, *do* seriously consider using a high-level language like Python that is highly flexible and can use C libraries. Someone will soon write extensions to Python that can take care of things like heavy-duty floating point math on the Cell, and Python should be able to use them quite nicely. Though admittedly, I'm not positive what this means for all those globally-declared buffers you'll need for communication between PPE and SPEs, but realistically, someone smarter than me *has* to find a fix.

This means that while programming for the Cell is a trip into the world of assembly, and many complain about it being a big step backwards, it actually enforces the separation between low-level functionality and high-level functionality. Expect to see existing Python applications be the first ones ported to Cell and utilizing its real horsepower - not long after C programmers bring heavy-hitting, SPE-aware libraries to the Cell.




Conclusions: What you need to know if you're contemplating a purchase

Sony, IBM and Toshiba are certainly hoping the Cell will become the dominant architecture of the coming decades. I don't see this happening. I'm not sure we'll be seeing Cell processors in PDAs and Cell phones and HDTVs like they predict. I don't think the Cell will ever reach the critical mass of popularity where market forces make it dramatically cheaper, as we saw with Intel processors.

However, for serious scientific programming, expect this thing to take some serious market share, and for good reason. Critics point out that there is a major barrier for adoption - much of the code needed to harness the power of Cell needs to be rewritten, so if you're buying a Cell, factor in the hidden cost of hiring one or more programmers to make use of it.

Make sure to consult with programmers first. Knowing your application and knowing what existing Cell-optimized libraries it might use could mean the difference between weeks or months of development time and a few hours. And programming for the Cell is easy and rewarding enough that you can realistically expect some Cell-optimized libraries within the next year or so.

Since IBM released their SDK under the Common Public License, any code that uses their SDK can be made public under the same license, but can also be compiled into proprietary applications. Hopefully we'll soon see a GPL alternative for the Free-as-in-speech world, but for now, if you need to get a job done (for cash or for academia) you have the freedom to do either with the tools IBM has given you.

The Cell for home use is on the horizon, but if you're an overclocking geek looking to get Playstation 3 graphics out of your home desktop: well, don't hold your breath.

Friday, October 06, 2006

Cell simulator screenshot


cell-sim-running
Originally uploaded by pazuzuzu.
Not-high-enough-res screenshot of a running Cell BE virtual machine. A bit of a drudge to use, but damned if I'm not giddy. It's basically like being ssh-ed in and X-forwarding from a very slow machine - pictured here actually *booting* Fedora Core 5.

Very neat, though it doesn't model penalties for memory access (which is, as far as I can tell, the biggest threat to Cell performance) and/or the actual PPU (Power-PC unit, the core of the machine), just the behavior of the SPUs (the eight floating-point, graphics-card-esque units attached to the PPU).

The Five Stages of Beginning Cell BE Programming

I'm only at the LSST for another month, and Tim, my boss, has suggested that I use the remainder of my time not on our prototype code but on playing with the Cell architecture.

My response to this has been broken into several stages:

1) Joyful excitement. Goodbye, UML! Goodbye, boring software engineering assignments! Hello to the exploring the hot-topic design of the new millenium!

2) Confusion! The media was not meant for computer science. Google "Cell architecture" and you'll hear nonsense that will blow your mind - "its the desktop on a chip!" "It's a global grid infrastructure!" "It's SkyNet!" and of course, the obvious backlash - "It's just a G4!" "It's a bunch of hype!" "It's Sony's downfall and IBM's Itanic!" Suddenly, finding realistic information not provided by IBM is impossible, and it's nearly as hard not to love or hate this thing before even understanding it.

3) Puzzled looks at the IBM website. Corporate websites are bad enough, but you'd think IBM would get the picture. There's a simulator, once you create an IBM ID; and, oh, wait, the simulator is part of the SDK? And half the necessary files are on the Barcelona Supercomputing Center website? And installing Fedora (Core 5 only!) is *required* to get it to run?

4) Frustration. The install script never works correctly. The Barcelona page is completely spazzy. Leave the download script running overnight and hope that it will finish before coming back to work. Arrrrghhhh.

5) Renewed excitement. You have to write separate programs for the separate components of the CPU? You have to be running the *simulated* version just to get any kind of terminal I/O (i.e. printf) from the 8 SPUs on the chip that do most of the computation? Is this a trip to the future or into the assembly programming days? Can this really ever fly? Will the install script ever work?

Friday, September 29, 2006

Musings on the Flaws of *NIX Shells

I spend most of my time during the day running shell commands. They're the fastest way to get complex (and even very simple) tasks done. I do plenty of writing code in other, more serious languages, but I'm somewhat hesitant to admit that I've never been good with any shell. I'm getting there, but boy, is it frustrating. And it's taken me years to deal with that frustration instead of running off to stronger tools like, say, C.

But you know what? That shouldn't have to be the case.

Let me start by bringing up an issue that's always irked me. Say you're using BASH (though tcsh, csh, zsh are all going to do pretty much the same thing here). You want to, say, cat a file with spaces in its name. (For the uninitiated, 'cat' takes a variable number of arguments, which are names of files, and then prints their contents.) So call our file
a b c.txt
As you probably know, shells differentiate (tokenize) arguments based on spaces (using spaces as delimiters). So
cat a b c.txt
will give errors - it tries to print the file a, then the file b, then the file c.txt. Whoops.

Fortunately, the original UNIX shell designers realized that you might want to use spaces in filenames. Good call. After all, we're not DOS users, here.
cat "a b c.txt"
does what we wanted. Great, you say. Problem solved.

So how about we do this?
MYVAR="a b c.txt"
cat $MYVAR
We again get contents of a, b, and c.txt, not of our target file. Whoops again!

Okay, so that makes sense. Because those quotes escaped the spaces for the MYVAR= command, and then were discarded, setting the value of MYVAR to
a b c.txt
. Makes sense.

So let's fix that code! We'll put quotes inside the quotes and get a quoted string for MYVAR! Now the shell will extract those quotes and send cat the parameter (singular) that we wanted.
MYVAR="\"a b c.txt\""
cat $MYVAR
Uh-oh! Instead, cat got three parameters again - this time they were
"a
b
c.txt"
Now, this makes sense, frustrating though it is (at first). Because after all, the real meaning of double-quotes is "spaces inside of here (and several other special characters we won't mention) are to be taken literally, not interpreted by the shell."

Now, say that we're shell designers and we're thinking this stuff up. The above is a pretty good decision. After all, if it weren't the case, how would you set MYVAR to the values
"a
b
c.txt"
anyway? It might be important to do so.

However, we've created a syntactic hole for ourselves now. How can we have our shell be smart enough to understand when the quotes in a variable are supposed to be interpreted by the shell? Tough question.

So there's another route - manually escape the spaces using the escape character \ (backslash). So we do this:
MYVAR=a\ b\ c.txt
cat $MYVAR
Whoohoo! We did it! Go, you mighty gods of UNIX, you Stephen Bournes out there, you really did use good sense.

But now say that the program foobar gives us the output

"a b c.txt"
"h i j.txt"
and we need, say, cat both those files. Okay. That's fine. We could try
cat `foobar`
but no! Unwise, you say. It's the same problem we had before. The shell doesn't interpret the output of foobar, silly, it just passes each space-delimited token on to cat, so we try to cat the following files:

"a
b
c.txt"
"h
i
j.txt"
and at best get complaints about those files not existing, at worst (and this is a serious problem) get the wrong files cat-ed. Yike!

Okay, so you say, there must be a way around this. And there is. You just have to learn the wonder of sed!
cat `foobar | sed -e s/ /\\\\\ /g -e s/\"//g`
(at least, I think that's it.) Right? What could be easier and more logical? And of course, sed isn't part of the shell, so boy, you're in a lot of trouble right now if you want this to work in a more limited environment. And make sure it's GNU sed; otherwise it might not work quite as expected. Oh, and make sure it's in your path.

The problem I'm beating over everyone's head here is this: To really accomplish everything you need in a shell, you need to learn more than a shell, you need to learn UNIX tradition. That's great and all, I guess, but wouldn't it be better if the shell could really do everything you need? Like maybe process all those strings in a more intelligent way without external tools like perl, or sed, or Python?

In short, this is how I think things were (ideally) meant to work:
  • The shell is your operating environment. It should be relatively self-contained.

  • (Non-shell) programs are things that we should run because they do something to files, or to the system, or to each other. In this sense, we can think of them as state-changers; and as such, we can think of their jobs as side effects, as computer science folk say. (They might also do calculation; they're better than the shell for this because they're generally faster, but we'll deal with that later.)

  • Conversely, the shell should take care of expressive issues - the issues of how to direct programs to do what we want, and to coordinate them to do our bidding. (Computation, the lambda calculus teaches us, is an expressive issue, but as noted before, programs might want to do that, too).


The above leads me to conclude that what we really need is a new way of thinking about shells, and a new way to make shells fulfill the above criteria. Right now they don't, and please, don't tell me that making sed into a BASH-builtin will change this issue.

And as you might have figured out from the above distinction between side effects and expression/computation, my mind is looking in the direction of functional programming - and its ill-loved, underappreciated bastard-child, LISP.

Tuesday, September 12, 2006

strace: A Great Tool I Never Noticed

I ran across this article completely accidentally, and now I'm gawking at myself and wondering why I didn't really know about this stuff. Make your life easier, Linux folksen:
All about Linux: strace - A very powerful troubleshooting tool for all Linux users

Wednesday, September 06, 2006

iSight Experiences

I recently picked up an iSight for my girlfriend and nabbed a spare one from work (which, of course, will be returned... eventually) and started exploring the wonderful world of iChat A/V.

All in all, I'm quite impressed - it's one of the few handy-dandy out-of-the-box Mac OS X things that makes me really very glad I have a G4 laying around. It's easy enough for everybody, and it actually makes this temporary long-distance situation between C. Rose and I quite a bit easier. I enjoy it for audio over a cell phone, and the video quality is shockingly good.

I have heard that it works much better between to OS X boxes than between OS X and Windows (or as the guy at the store said, "Mac to PC" - seriously, people, PC is not the inverse of Mac - but that's a rant for another time), so I assume that there must be some major scheduling hacks going on in the kernel to make it so silky-smooth.

The one significant problem, though - and it's a doozy - is that iChat now sporadically causes my Linksys Wireless-G router to take a hard dive. The lights on the front panel keep blinking, but connection between LAN and internet goes away. At first I blamed Cox, the famously questionable service provider, but I found that power cycling my router results in the problems going away.

This is a pretty perplexing situation, and thus far I haven't Googled up any explanation. Will post on further information.