This section explains our new approach to load balancing, which is based on the PARC (polygon assisted ray casting) algorithm [2]. The section presents a short description of PARC and describes different approaches to using it as a load balancing technique. PARC can be characterized as a presence acceleration technique [4], like the octree decompositions of Levoy [8]. Instead of stepping through the whole volume for rendering, only the parts that contain relevant data are used, this can save an enormous amount of rendering time, not only in volume stepping, but also because it greatly decreases the number of compositing and shading calculations one needs to perform.
The rational behind PARC is simple. As one needs to calculate the
integral
during rendering, PARC finds tighter bounds for
and
,
thus, substantially lowering the rendering time. PARC does this by
enclosing the volume with a rough polygonal approximation, which is
transformed and scan converted into front and back Z buffers. For each
ray the front one gives us a conservative estimate for
, and the
back one gives the
estimate.
In order to skip over empty space inside volumes, our implementation of PARC uses pre-calculated cubes aligned with the primary axes to bound cubes inside the volume. For each particular view, we scan convert the cubes into a Z buffer (implemented in software) to obtain closer bounds on the intervals where the ray integrals need to be calculated. This method achieves speeds comparable with the fastest high quality volume renderers.
One can specify the number of cubes in the subdivision of the original
dataset. This determines the accuracy of the
and
estimates; the higher the number of cubes, the closer to the exact
intersection points they are. If the estimates are accurate, we
perform less work on the ray, but on the other hand the scan
conversion time is higher as the number of cubes grows very fast. For
instance, one can ask for a level 4 PARC approximation, this
means the dataset is partitioned to
intervals in each of
the coordinate directions, for a total of 4096 small cubes.
Depending on the low and high threshold specified, one usually gets a
much lower number of such cubes. For instance, with a level 4 PARC
approximation of a CT 3D reconstructed head at a 20-200 threshold,
only 38% of the cubes are non-empty.
The cubes generated by PARC are the basic units for our load
balancing. As the cubes are very close approximation of the amount of
work one has to perform during ray tracing, we use the number of cubes
a processor has as the measure of how much work is performed by that
particular processor. Let P denote the number of processors, and
the number of cubes processor i has. To achieve a good load
balancing we need a scheme that minimizes the following
heuristic function for a partition
:
The main problem in implementing this approach is that for ray casting
to be efficient the dataset part of a particular processor need to be
contiguous. Not only this makes compositing easier but it also reduces
the number of intersection calculations required. Once one decides
what shape to assign to each processor, one just needs to use either
Equation 1 or a variation of it. For the rest of the paper
we describe an implementation of our load balancing scheme that uses
slabs, which are consecutive slices of the dataset aligned on two
major axes, as the basic partition blocks of the dataset for load
balancing. Slabs are very easy to implement and we show that they
provide a good sense of load balance. In the case of slabs, the
PARC algorithm produces an ordered list of number:
, which are the number of cubes in each slab. We need to
find indices pairs
, that minimizes the following
expression:
The problem of computing the optimal (as defined by our heuristic choice) load balance partition indices can be solved naively as follows. We can compute all the possible partitions of the integer n, where n is the number of slabs, into P numbers, where P is the number of processors. For example, if n = 5, and P = 3, then 1 + 1 + 3 represents the solution that gives the first slab to the first processor, the second slab to the second processor and the remaining three slabs to the third processor. Enumerating all possible partitioning to get the optimal one is a feasible solution but can be very computationally expensive for large n and P. At this time we have a Prolog implementation of a slightly revised algorithm. Instead of calculating the minmax Equation 2, we choose the permutation with the smallest square difference from the average.
In order to show how well our approach works in practice, let us work
out the example of using our load balancing example to divide the
neghip dataset (the negative potential of a high-potential iron
protein of
resolution) for four processors, using a level 4
PARC decomposition with a 10 to 200 value threshold. After running
PARC we get the following 16 numbers, one for each slab, out of the
1570 total cubes:{12, 28, 61, 138, 149, 154, 139, 104, 106, 139, 156,
151, 129, 62, 29, 13}. The naive approach of other volume renderers
has been to assign an equal part of the volume to each processor,
resulting in the following partition: {12+28+61+138=239,
149+154+139+104=546, 106+139+156+151=552, 129+62+29+13=233}, where
processors 2 and 3 have twice as much work than processor 1 and 4. Our
approach based on Equation 2 gives us {388, 397,
401, 384}, clearly a much more balanced solution.
One can see that some configurations will yield better load balancing than others but this is a limitation of the particular space subdivision one chooses to implement, the more complex the subdivision one allows, the better load balancing but the harder it is to implement a suitable load balancing scheme and the associated ray caster. Figure 1 plots the examples just described for the naive approach. Figure 2 shows how well our load balancing scheme works for a broader set of processor arrangements. By comparing both plots, one can see that our algorithm generates much smoother curves, thus leading to better load balancing.
Figure 1: The graph shows the number of cubes per
processor under naive load balancing.
Figure 2: Load balancing measures for our algorithm.
The graph shows the number of cubes the processor receives in our
algorithm.
Figure 3: Naive load balancing on the Paragon. The
graph shows the actual rendering times for 4 processors using the
naive load balancing.
Figure 4: Our load balancing on the Paragon.
The graph shows the actual rendering times for 4 processors using our
load balancing.
Figures 3 and 4 show the rendering times on the Intel Paragon, showing the correlation between the number of cubes a processor has and the amount of work it has to perform. By comparing these graphs and those in Figures 1 and 2, one can observe that our load balancing is effective and accurate, compared to the naive approach of equally subdividing the dataset. If one was calculating a single image, the total rendering time of the image subparts would be the maximum of every processor plus the compositing time. As will be seen in the next section, we use a pipeline approach to optimize image generation performance, by amortizing compositing over time.