In analyzing the performance of parallel algorithms, there are many considerations related to the machine limitations, like for instance, communication network latency and throughput [11]. Latency can be measured as the time it takes a message to leave the source processor and be received at the destination end. Throughput is the amount of data that can be sent on the connection per unit time. These numbers are particularly important for algorithms in distributed memory architectures. They can change the behavior of a given algorithm enough to make it completely impractical.
Throughput is not a big issue for methods based on volume ray casting that perform static data distribution with ray dataflow as most of the communication is amortized over time [10,6,3]. On the other hand, methods that perform compositing at the end of rendering or that have communication scheduled as an implicit synchronization phase have a higher chance of experiencing throughput problems. The reason for this is that communication is scheduled all at the same time, usually exceeding the machines architectural limits. One should try to avoid synchronized phases as much as possible.
Latency is always a major concern, any algorithm that requires
communication pays a price for using the network. The start up
time for message communication is usually long compared to CPU speeds.
For instance, in the iPSC/860 it takes at least 200
s to complete
a round trip message between two processors. Latency hiding is an important
issue in most algorithms, if an algorithm often blocks waiting for
data on other processors to continue its execution, it is very likely this
algorithm will perform badly. The classic ways to hide latency is to
use pipelining or pre-fetching [5].
Even though latency and throughput are very important issues in the design and implementation of a parallel algorithm, the most important issue by far is load balancing. No parallel algorithm can perform well without a good load balancing scheme.
Again, it is extremely important that the algorithm has as few inherently sequential parts as possible if at all. Amadahl's law [5] shows how speed up depends on the parallelism available in your particular algorithm and that any, however small, sequential part will eventually limit the speed up of your algorithm.
Given all the constraints above, it is clear that to obtain good load balancing one wants an algorithm that:
A subtle point in our requirements is in the last phrase, how do we
classify useful work ? We define useful work as the number
of instructions
executed by the best sequential algorithm
available to volume render a dataset. Thus, when a given parallel
implementation uses a suboptimal algorithm, it ends up using a much
larger number of instructions than theoretically necessary as each
processor executes more instructions than
(P
denotes the number of processors). Clearly, one needs to compare with
the best sequential algorithm as this is the actual speed up the
user gets by using the parallel algorithm instead of the sequential
one.
The last point on useful work is usually neglected in papers on parallel volume rendering and we believe this is a serious flaw in some previous approaches to the problem. In particular, it is widely known that given a transfer function and some segmentation bounds, the amount of useful information in a volume is only a fraction of its total size. Based on this fact, we can claim that algorithms that use static data distribution based only on spatial considerations are presenting ``efficiency'' numbers that can be inaccurate, maybe by a large margin.
To avoid the pitfalls of normal static data distribution, we present in the next section a new way to achieve realistic load balancing. Our load balancing scheme, does not scale linearly as others claimed before, but achieves very fast rendering times while minimizing the ``work'' done by the processors.