next up previous
Next: Conclusions and Future Up: Parallel Performance Measures for Previous: Parallel Ray Casting

Performance Analysis

In this section we present a few performance figures of our algorithm and demonstrate that our approach is sound and fast. The main points that we are discussing are: the effectiveness of PARC load balancing, the communication overhead of the compositing scheme, algorithm behavior under different shading models, and overhead as compared to a sequential implementation. The effectiveness of our PARC load balancing was studied extensively in the last section, but to complete our choice of using PARC as our ray casting algorithm, it is interesting to compare its advantages to a more naive ray casting approach where no presence accelerations are adopted.

A conventional ray caster where the rays are cast from start to end by calculating intersections with the bounding box of the object is only slightly different from a PARC ray caster. A PARC ray caster actually does more work than a naive one, as it needs to scan convert and to find and from the Z buffer. The place that a PARC ray caster really gains performance is in the fact that it better approximates the volume bounds. It should be clear that the higher the cost of the shading function per step, the more advantageous it is to calculate these bounds well. In Figure 6, we can see how a PARC based ray caster performs against a naive ray caster under different shading functions. For our purposes we consider ``light'' shading a method that uses 5-10 instructions per sample, ``medium'' a method that uses 50 instructions per sample, and ``heavy'' shading functions require about 300 instructions per sample. Nieh and Levoy [12] have reported that trilinear interpolating a ray sample takes 320 instructions. One can see from Figure 6 that not only times but also the rate of increase of cost decreases as one computes more samples.

  
Figure 6: PARC versus naive ray casting. Times were calculated on a Sparc1000.

The work performed during rendering each ray can be broken into , the initialization work, and , the work performed to calculate and shade the samples along the ray. If perfect load balancing is achieved for every ray, each processor will perform work per ray, that is, the initialization time is replicated for every ray. If , then we can achieve very high scalability with the algorithm, otherwise, as the number of processors increases the amount of work done on the initialization by all the processors gets larger than , thus limiting the performance. This makes optimization of the initialization time critical to the performance of the algorithm.

Initialization time is composed of several components, the most time consuming being the PARC projection time and the transformation time. Right now it takes anywhere from 350 msec to 1600 msec to scan convert a level 4 PARC approximation on the machines we used. We believe scan conversion itself can be done on the order of 20 times faster when the code is re-written in a more efficient way, possibly in i860 specific code. By broadcasting at the beginning only the necessary PARC polygons we can also avoid increasing the number of polygons that need to be scan converted in each processor, and at the same time decrease the memory requirement. Until these changes get incorporated in our code, our timings are going to be around the 1 second mark, even if the rest of the algorithm takes no time at all. However, overall rendering times decreased substantially after we optimized our transformation time by factoring out all common matrix multiplications and inlining the ones inside tight loops.

All of the performance numbers presented in the rest of the section are for the Intel Paragon. The Intel Paragon uses an Intel i860XP, a 50MHz superscalar microprocessor, and a 2D mesh interconnection network. Every processor of the machine actually contains two i860XPs, but only one is used for computation.

Figures 7 and 8 show the average time to composite different image sizes in three different machine configurations and for five different screen sizes, ignoring completely the rendering time. We use a slow start technique for these measures (only when pipelining). It is interesting to compare the figures as one can see that our pipelining method can very well hide the effects of the network and the work done to composite the image. For instance, every processor has to spend around 52 msec to composite a image (only compute time), if we consider 6 processors, it will take over 300 msec of CPU time to generate this image, still with our pipelining approach the user only sees 65 msec as opposed to 330 msec a sequential composite requires.

  
Figure 7: Timing as seen by a user of the arriving of images using our pipelining approach.

  
Figure 8: Timing as seen by a user of the arriving of images using sequential composite.

In Figure 9, we present some of our rendering times. These are rendering times for our first implementation and should not be regarded as what we are expecting for the production level code. One can see from the graph that our algorithm scales well as the number of processors increases. Also our prediction that the higher the shading cost, the better the parallel scalability can be seen from the graphs. We have filtered out the PARC rendering time from the numbers. We expect to speed the PARC projection times up by at least 20 times with the new scan conversion routines, and with a new set of fast PARC projection techniques being designed we anticipate getting scan conversion to under 25 msec. At this time, the best rates attainable by our algorithm are about 1.5 frames/sec on a 32 processor configuration of our Intel Paragon for a image size. This is very competitive and even better than other rendering times published for machines with this number of processors.

  
Figure 9: Rendering times on an Intel Paragon.



next up previous
Next: Conclusions and Future Up: Parallel Performance Measures for Previous: Parallel Ray Casting



Claudio Silva
Thu Apr 20 15:18:55 EDT 1995