next up previous
Next: Parallel Ray Casting Up: Parallel Performance Measures for Previous: Performance Considerations

Load Balancing

 

This section explains our new approach to load balancing, which is based on the PARC (polygon assisted ray casting) algorithm [2]. The section presents a short description of PARC and describes different approaches to using it as a load balancing technique. PARC can be characterized as a presence acceleration technique [4], like the octree decompositions of Levoy [8]. Instead of stepping through the whole volume for rendering, only the parts that contain relevant data are used, this can save an enormous amount of rendering time, not only in volume stepping, but also because it greatly decreases the number of compositing and shading calculations one needs to perform.

The rational behind PARC is simple. As one needs to calculate the integral during rendering, PARC finds tighter bounds for and , thus, substantially lowering the rendering time. PARC does this by enclosing the volume with a rough polygonal approximation, which is transformed and scan converted into front and back Z buffers. For each ray the front one gives us a conservative estimate for , and the back one gives the estimate.

In order to skip over empty space inside volumes, our implementation of PARC uses pre-calculated cubes aligned with the primary axes to bound cubes inside the volume. For each particular view, we scan convert the cubes into a Z buffer (implemented in software) to obtain closer bounds on the intervals where the ray integrals need to be calculated. This method achieves speeds comparable with the fastest high quality volume renderers.

One can specify the number of cubes in the subdivision of the original dataset. This determines the accuracy of the and estimates; the higher the number of cubes, the closer to the exact intersection points they are. If the estimates are accurate, we perform less work on the ray, but on the other hand the scan conversion time is higher as the number of cubes grows very fast. For instance, one can ask for a level 4 PARC approximation, this means the dataset is partitioned to intervals in each of the coordinate directions, for a total of 4096 small cubes. Depending on the low and high threshold specified, one usually gets a much lower number of such cubes. For instance, with a level 4 PARC approximation of a CT 3D reconstructed head at a 20-200 threshold, only 38% of the cubes are non-empty.

The cubes generated by PARC are the basic units for our load balancing. As the cubes are very close approximation of the amount of work one has to perform during ray tracing, we use the number of cubes a processor has as the measure of how much work is performed by that particular processor. Let P denote the number of processors, and the number of cubes processor i has. To achieve a good load balancing we need a scheme that minimizes the following heuristic function for a partition :

 

The main problem in implementing this approach is that for ray casting to be efficient the dataset part of a particular processor need to be contiguous. Not only this makes compositing easier but it also reduces the number of intersection calculations required. Once one decides what shape to assign to each processor, one just needs to use either Equation 1 or a variation of it. For the rest of the paper we describe an implementation of our load balancing scheme that uses slabs, which are consecutive slices of the dataset aligned on two major axes, as the basic partition blocks of the dataset for load balancing. Slabs are very easy to implement and we show that they provide a good sense of load balance. In the case of slabs, the PARC algorithm produces an ordered list of number: , which are the number of cubes in each slab. We need to find indices pairs , that minimizes the following expression:

 

The problem of computing the optimal (as defined by our heuristic choice) load balance partition indices can be solved naively as follows. We can compute all the possible partitions of the integer n, where n is the number of slabs, into P numbers, where P is the number of processors. For example, if n = 5, and P = 3, then 1 + 1 + 3 represents the solution that gives the first slab to the first processor, the second slab to the second processor and the remaining three slabs to the third processor. Enumerating all possible partitioning to get the optimal one is a feasible solution but can be very computationally expensive for large n and P. At this time we have a Prolog implementation of a slightly revised algorithm. Instead of calculating the minmax Equation 2, we choose the permutation with the smallest square difference from the average.

In order to show how well our approach works in practice, let us work out the example of using our load balancing example to divide the neghip dataset (the negative potential of a high-potential iron protein of resolution) for four processors, using a level 4 PARC decomposition with a 10 to 200 value threshold. After running PARC we get the following 16 numbers, one for each slab, out of the 1570 total cubes:{12, 28, 61, 138, 149, 154, 139, 104, 106, 139, 156, 151, 129, 62, 29, 13}. The naive approach of other volume renderers has been to assign an equal part of the volume to each processor, resulting in the following partition: {12+28+61+138=239, 149+154+139+104=546, 106+139+156+151=552, 129+62+29+13=233}, where processors 2 and 3 have twice as much work than processor 1 and 4. Our approach based on Equation 2 gives us {388, 397, 401, 384}, clearly a much more balanced solution.

One can see that some configurations will yield better load balancing than others but this is a limitation of the particular space subdivision one chooses to implement, the more complex the subdivision one allows, the better load balancing but the harder it is to implement a suitable load balancing scheme and the associated ray caster. Figure 1 plots the examples just described for the naive approach. Figure 2 shows how well our load balancing scheme works for a broader set of processor arrangements. By comparing both plots, one can see that our algorithm generates much smoother curves, thus leading to better load balancing.

  
Figure 1: The graph shows the number of cubes per processor under naive load balancing.

  
Figure 2: Load balancing measures for our algorithm. The graph shows the number of cubes the processor receives in our algorithm.

  
Figure 3: Naive load balancing on the Paragon. The graph shows the actual rendering times for 4 processors using the naive load balancing.

  
Figure 4: Our load balancing on the Paragon. The graph shows the actual rendering times for 4 processors using our load balancing.

Figures 3 and 4 show the rendering times on the Intel Paragon, showing the correlation between the number of cubes a processor has and the amount of work it has to perform. By comparing these graphs and those in Figures 1 and 2, one can observe that our load balancing is effective and accurate, compared to the naive approach of equally subdividing the dataset. If one was calculating a single image, the total rendering time of the image subparts would be the maximum of every processor plus the compositing time. As will be seen in the next section, we use a pipeline approach to optimize image generation performance, by amortizing compositing over time.



next up previous
Next: Parallel Ray Casting Up: Parallel Performance Measures for Previous: Performance Considerations



Claudio Silva
Thu Apr 20 15:18:55 EDT 1995