arch detail

arch detail

Wednesday, October 18, 2006

Output from IBM's xlc for Cell BE

"flopsie_spu.c", line 56.5: 1506-068 (S) Operation between types "vector float" and
"<><><> Illegal tVectorType!!!" is not allowed.


Oh, IBM xlc, you always make me smile.

Saturday, October 14, 2006

The Cell BE: What works, what doesn't

There's no shortage of talk out there on whether or not the Cell will ever take off. I've been fortunate enough to get some time actually coding it, and I think it's important that I get my personal views out there. The Cell really does present some unique challenges and some unique opportunities for the computing world at large, and its fate will likely decide the future of IBM, and possibly Sony and Toshiba as well.

This is likely to be long, so I'm breaking it down into three areas:

An Introduction briefly defines the Cell architecture and what critics have said about it (i.e., hype and response).

Cell in Practice is the section on the nitty-gritties of how to programming for the Cell within the paradigm provided by IBM. It's intended for programmers and based on my personal experiences, and if you're not a programmer, feel free to skip it.

Conclusions is written for anyone who wants to know what the Cell means to them and the future of the computer industry. Programmers might appreciate it, but you shouldn't need any serious technical knowledge to appreciate it.




An Introduction: What has already been said

The Cell architecture has been hyped and misunderstood on a level few bits of technology ever have. I blame, at least partially, the video game industry and mainstream media for making a lot of noise without much understanding.

But here's the overview of what Cell is:

The Cell architecture is IBM/Sony/Toshiba's new joint investement. It's a radical departure from conventional CPU architectures as we know them today. Rather than one individual CPU on the chip, it actually contains a PPE (Power Processor Element, basically a simplified G5) and eight SPEs (Synergistic Processing Elements, vectorized floating-point optimized processors) which are all connected on a high-bandwidth bus (Element Interface Bus, EIB).

Sony/IBM/Toshiba claim that the Cell will make truly next-gen performance feasible, *and* do it with less power intake than conventional CPUs, which have stagnated in recent years. The low power consumption means that the Cell might be well-suited to making next-generation multimedia and encryption technology possible within handheld devices like cell phones, PDAs and consumer electronics devices like HDTVs. And, of course, the huge perfomance increases for video and multimedia made it a logical choice for the Playstation 3. And eventually, it might even take the desktop computer world.

But this radical departure means that to get better performance out of any application, it has to be hand-recoded, paralellized and retuned for the Cell architecture. You can recompile your app for the new architecture, but it will only use the PPE, but it'll actually run slower than on an older G5.

And naturally, this is the main argument made against any hope of the Cell taking off: every architecture in history has taken off because of the amount of legacy software it can run; how else can you explain the way the monstrous Intel x86/Windows paradigm has conquered the world? Any processor that actually lowers performance for most applications - and won't run, say, Windows or MS Office, erm, anytime soon - doesn't seem to be a likely candidate for the future of computing.




Cell in Practice: A practical example of programming the Cell BE

I've had the good fortune of exploring the performance of the Cell for a week now, and it's been a sometimes-frustrating but ultimately rewarding experience. In fact, I used IBMs vectorized 1D FFT to create my own 2D FFT library (not quite finished, hopefully soon to be on SourceForge.) Here's how the real programming paradigm works out, at least for me, and the associated pitfalls:

IBM has released a SDK and system simulator under the Common Public License. The SDK contains a few helpful examples, though the code can be a little perplexing at times. They also will provide you with modified versions of GCC and associated GNU tools, as well as the IBM XLC compilers for Cell, free of charge.

Perhaps most importantly, there are example Makefiles in the SDK. You'll want to look long and hard at those. Since the PPE and SPEs are effectively different architectures, you'll need to write and compile separate code for both of them, then *link* the SPE program into the PPE program so that the PPE can spawn SPE threads on the 8 SPEs.

This is where we start to run into some of the hassles of coding for the SPEs. The SPEs keep everything in "Local Store", which is basically an L2 Cache. The size of the LS on the Cell BE is 256KB/SPE, but that can change on future Cell models. Furthermore, you'll have to explicitly manage movement between main memory and Local Stores - and you'll have to factor in the size of your running code when you calculate your memory requirements. There is nothing like virtual memory to help you. I understand that if you've done programming for a GPU, this isn't anything new, but it was very new to me.

You'll have to figure out what you want those SPEs doing, of course, and you need to put them to good use if you want to get any speedup out of this architecture. Fortunately for me, a 2D FFT is well-suited; just break up the image into 8 column-wise segements, then distribute them from main memory to SPE Local Stores with IBM's Direct Memory Access (DMA) library and process them one column at a time on the SPEs, then move them back to the main store using DMA.

A quick pitfall of using IBM's DMA: SPEs cannot access any data that is not allocated in a global scope in C. This took quite a while for me to figure out, and it means that you'll have to get past the "good programming practices" verboten on global allocation and statically allocate a huge buffer for communicating with the SPUs.

To get peak performance for DMA transfers, IBM claims that carefully using their other C extensions like __attribute__ ((alignment (128))) will speed up DMAs and vector ops, but I didn't personally find this to be the case - the compilers seemingly took care of this for me, as my benchmarking experiments didn't seem to show any difference when I attempted this level of optimization.

The next step to optimizing this one was to use a vectorized 1D FFT. IBM provides you with C/C++ extensions for declaring vector data types and a bunch of vector math functions that do a straight 1-to-1 mapping from C functions like vec_add or spu_add straight to the corresponding assembly functions. Getting them set up is definitely a tad confusing, though; you'll need to include header files and make sure to use the IBM-distributed compilers with the correct option flags, and any erroneous usage of vector calls will result in completely perplexing compiler messages.

Fortunately for me, IBM provide a vectorized 1D FFT, unhelpfully provided under the misnomer fft_2d, and they give an example in the documentation.

Shockingly, their example of fft_2d usage in the users guide would actually cause an "unexpected internal compiler error" in spuxlc, the XLC compiler. Fortunately, when I informed IBM on the developer's forums, they were quick to respond that the example was bad (it required more memory than one could possibly allocate on an SPE) and promise that the compiler would be fixed by the next drop, but I never got any word on an improved example for the users guide. It was the first hint I've seen that IBM must be getting desperate - the Cell Blade server is on the market, and it's a little late to play the "it's just pre-release!" game with users now.

After some initial testing, it became obvious that DMA was a huge portion of my execution time, because of the fact that I was waiting on data every time I requested it, my DMA transfers were becoming a huge burden. Requesting 4-6 accessed at a time and then waiting on the data only when I absolutely needed it ended up giving me something along the lines of a 6x speedup.

Now comes the next phase of development: Starting over. Why? Well, in my case, it turns out that the vectorized FFT approach means I don't have memory for double-buffering, and it sure looks like vectorization in this case is a lot less helpful than lowering my DMA wait times. So call my library incomplete for now.

For all this work, was it worth the effort?

Well, consider my surprise when I compared my dinky, incomplete FFT lib's performance to the performance of FFTW (the Fastest Fourier Transform in the West) running on Itanium (at NCSA's TeraGrid site) and on my local 2.8 GHz Pentium 4.

The results?

I beat them both. Yep. My first attempt at any kind of FFT library, and also my first attempt at programming the Cell (and this from a guy who's never done graphics *or* embedded systems, the two things that would make you well-qualified for Cell programming) was enough to beat out the highest stage of FFT evolution on Intel architectures. And my library only stands to get better.

Bearing in mind that FFTs are one of the most heavily-used tools in science and multimedia (both sound and video), this is no small news. Not everyone gets to knock out the champ on their first shot.

On the other hand, this is purely an in-core FFT - FFTW will scale to huge resolutions independent of L2 cache size. My version will have to discard vectorization for anything above 2048 x 2048 (which I might do anyway) and then will need to completely rewritten for anything that exceeds 16K x 16K. This is entirely because of the fact that the SPEs are completely restricted to local store, and dealing with memory usage beyond those bounds is completely the responsibility of the programmer.

So if you want to use this library for larger-sized 2D FFTs, you'll need to either wait for IBM to release a Cell with a larger Local Store - size is not specified in the design specs; so Local Store size could increase or decrease at any time - or use another library.

There are some other funny caveats for programmers, and with some funny implications for the consumer world.

First: Statically compiled libraries are the only ones you can use in the SPEs. This is because the libraries *and* your code *and* your data all have to live on the Local Store. The biggest restriction on performance in the SPEs, at least in my experience, is the size of the Local Store. Thus, choice of libraries and choice of compiler are both going to have significant effects on how much space you have left to allocate for actually doing work. This is certainly going to require some interesting programming approaches, and I highly doubt it will make for more readable code!

Further, it also means that if all your library, bar, contains a statically compiled library foo, you'll have to recompile bar every time foo is upgraded. And you can imagine that this will quickly get out of control.

Second: Be careful about your assumptions regarding how much data space you can use in the SPEs. The size of the Local Store could change at any time with new models of Cells. This might make your life easier if Local Stores get bigger. It might also mean that you run into a lot of trouble if your library needs to run on a scaled-down Cell for portable devices.

Third: Don't kid yourself about using a high-level language for optimized coding on the Cell. It just won't ever happen. However, *do* seriously consider using a high-level language like Python that is highly flexible and can use C libraries. Someone will soon write extensions to Python that can take care of things like heavy-duty floating point math on the Cell, and Python should be able to use them quite nicely. Though admittedly, I'm not positive what this means for all those globally-declared buffers you'll need for communication between PPE and SPEs, but realistically, someone smarter than me *has* to find a fix.

This means that while programming for the Cell is a trip into the world of assembly, and many complain about it being a big step backwards, it actually enforces the separation between low-level functionality and high-level functionality. Expect to see existing Python applications be the first ones ported to Cell and utilizing its real horsepower - not long after C programmers bring heavy-hitting, SPE-aware libraries to the Cell.




Conclusions: What you need to know if you're contemplating a purchase

Sony, IBM and Toshiba are certainly hoping the Cell will become the dominant architecture of the coming decades. I don't see this happening. I'm not sure we'll be seeing Cell processors in PDAs and Cell phones and HDTVs like they predict. I don't think the Cell will ever reach the critical mass of popularity where market forces make it dramatically cheaper, as we saw with Intel processors.

However, for serious scientific programming, expect this thing to take some serious market share, and for good reason. Critics point out that there is a major barrier for adoption - much of the code needed to harness the power of Cell needs to be rewritten, so if you're buying a Cell, factor in the hidden cost of hiring one or more programmers to make use of it.

Make sure to consult with programmers first. Knowing your application and knowing what existing Cell-optimized libraries it might use could mean the difference between weeks or months of development time and a few hours. And programming for the Cell is easy and rewarding enough that you can realistically expect some Cell-optimized libraries within the next year or so.

Since IBM released their SDK under the Common Public License, any code that uses their SDK can be made public under the same license, but can also be compiled into proprietary applications. Hopefully we'll soon see a GPL alternative for the Free-as-in-speech world, but for now, if you need to get a job done (for cash or for academia) you have the freedom to do either with the tools IBM has given you.

The Cell for home use is on the horizon, but if you're an overclocking geek looking to get Playstation 3 graphics out of your home desktop: well, don't hold your breath.

Friday, October 06, 2006

Cell simulator screenshot


cell-sim-running
Originally uploaded by pazuzuzu.
Not-high-enough-res screenshot of a running Cell BE virtual machine. A bit of a drudge to use, but damned if I'm not giddy. It's basically like being ssh-ed in and X-forwarding from a very slow machine - pictured here actually *booting* Fedora Core 5.

Very neat, though it doesn't model penalties for memory access (which is, as far as I can tell, the biggest threat to Cell performance) and/or the actual PPU (Power-PC unit, the core of the machine), just the behavior of the SPUs (the eight floating-point, graphics-card-esque units attached to the PPU).

The Five Stages of Beginning Cell BE Programming

I'm only at the LSST for another month, and Tim, my boss, has suggested that I use the remainder of my time not on our prototype code but on playing with the Cell architecture.

My response to this has been broken into several stages:

1) Joyful excitement. Goodbye, UML! Goodbye, boring software engineering assignments! Hello to the exploring the hot-topic design of the new millenium!

2) Confusion! The media was not meant for computer science. Google "Cell architecture" and you'll hear nonsense that will blow your mind - "its the desktop on a chip!" "It's a global grid infrastructure!" "It's SkyNet!" and of course, the obvious backlash - "It's just a G4!" "It's a bunch of hype!" "It's Sony's downfall and IBM's Itanic!" Suddenly, finding realistic information not provided by IBM is impossible, and it's nearly as hard not to love or hate this thing before even understanding it.

3) Puzzled looks at the IBM website. Corporate websites are bad enough, but you'd think IBM would get the picture. There's a simulator, once you create an IBM ID; and, oh, wait, the simulator is part of the SDK? And half the necessary files are on the Barcelona Supercomputing Center website? And installing Fedora (Core 5 only!) is *required* to get it to run?

4) Frustration. The install script never works correctly. The Barcelona page is completely spazzy. Leave the download script running overnight and hope that it will finish before coming back to work. Arrrrghhhh.

5) Renewed excitement. You have to write separate programs for the separate components of the CPU? You have to be running the *simulated* version just to get any kind of terminal I/O (i.e. printf) from the 8 SPUs on the chip that do most of the computation? Is this a trip to the future or into the assembly programming days? Can this really ever fly? Will the install script ever work?