Although 12 of the 13 Dwarfs possess some form of parallelism, finite state machinesWell, I thought about this for a bit, and I realized there are simple methods for using parallelization for Huffman decoding, provided there are sufficiently many nodes.
(FSMs) look to be a challenge, which is why we made them the last dwarf. Perhaps FSMs
will prove to be embarrassingly sequential just as MapReduce is embarrassingly parallel.
If it is still important and does not yield to innovation in parallelism, that will be
disappointing, but perhaps the right long-term solution is to change the algorithmic
approach. In the era of multicore and manycore. Popular algorithms from the sequential
computing era may fade in popularity. For example, if Huffman decoding proves to be
embarrassingly sequential, perhaps we should use a different compression algorithm that
is amenable to parallelism.
A quick Google search trivially demonstrates this fact. However, oddly, I can't find any freely available description of the technique they published.*
If you think about it, though, parallel Huffman Decoding shouldn't be hard if we have a suitably large number of processors, and we assume that communication time between processor elements is very short.
With Huffman coding, variable-length strings of 1s and 0s describe a path down a binary tree. The binary tree is uneven in length and some terminal leaves are closer than others (the closer leaves are for data items which are most commonly accessed).
I'll try to describe how this might work in fairly plain English and not use much formality, because it's pretty simple, really.
We require that our tree be implemented such that we can instantly calculate the location of a given subtree given its depth and which subtree at that level it is (the first, or second, or third...). This is possible if the tree is implemented as an array and unused slots are filled with a NULL character. (I'm not sure about the term for this implementation, but I've used it before).
All we need to do is choose a smallish integer, N, which will decide the amount of parallelism. Let's say our N is 4.
One processor will constitute the first set of workers, 2^4 = 16 processors will constitute the second, and 2^8 will constitute the third, and 2^12 will constitute the fourth, etc.
At time = 0, set one (the main processor) will decode the first N bits (4) of the coded tree.
Also at time = 0, each processor in set two looks at a different subtree starting at depth 4. Since our coded tree is implemented correctly, we do not need to descend the tree in order to find the subtrees, but can access them more or less instantly. There are at least as many processors as subtrees, and if we run out of subtrees, then some processors are inactive.
Also at time = 0, each processor in set three looks at a different subtree starting at depth 8. If there are more processors than subtrees, then some processors are inactive.
This continues for each set down the line, theoretically ad infinitum or until we run out of processors.
At time = 1, set one (the first processor) will have either found a terminal, or have found a subtree. That subtree will have been decoded by a processor in set two. That processor in set two will have found either a terminal or a subtree. That subtree will have been decoded by a processor in set three. That processor will have found a terminal or a subtree...
And so on.
So at time = 1, we just need to query processors as to whether they got a terminal or another subtree, then repeat until we get a terminal.
So our runtime is at N * (time to descend a level of a binary tree) + floor(total depth of the terminal / N) * (time to communicate result between two processors). Because of the way Huffman coding works, we can expect that the total depth of the terminal is small.
I'm pretty sure that this sorks. If anyone asks, I might write this up in a more formal way, but this seems so intuitive that I doubt it's worth the effort.
So in conclusion: Huffman coding is not embarrassingly sequential. In fact, its neatly parallel - if you have infinite processors and don't care about using them, anyway - and if this wasn't freely published before, it is now.
*Note: I haven't read any of these papers. The technique described here is entirely my own invention. It may or may not be similar to theirs.
No comments:
Post a Comment