In this section we present a few performance figures of our algorithm and demonstrate that our approach is sound and fast. The main points that we are discussing are: the effectiveness of PARC load balancing, the communication overhead of the compositing scheme, algorithm behavior under different shading models, and overhead as compared to a sequential implementation. The effectiveness of our PARC load balancing was studied extensively in the last section, but to complete our choice of using PARC as our ray casting algorithm, it is interesting to compare its advantages to a more naive ray casting approach where no presence accelerations are adopted.
A conventional ray caster where the rays are cast from start to end by
calculating intersections with the bounding box of the object is only
slightly different from a PARC ray caster. A PARC ray caster actually
does more work than a naive one, as it needs to scan convert and to find
and
from the Z buffer. The place that a PARC ray caster
really gains performance is in the fact that it better approximates
the volume bounds. It should be clear that the higher the cost of the
shading function per step, the more advantageous it is to calculate
these bounds well. In Figure 6, we can see how a
PARC based ray caster performs against a naive ray caster under
different shading functions. For our purposes we consider ``light''
shading a method that uses 5-10 instructions per sample, ``medium'' a
method that uses 50 instructions per sample, and ``heavy'' shading
functions require about 300 instructions per sample. Nieh and Levoy
[12] have reported that trilinear interpolating a ray sample
takes 320 instructions. One can see from Figure
6 that not only times but also the rate of
increase of cost decreases as one computes more samples.
Figure 6: PARC versus naive ray casting. Times
were calculated on a Sparc1000.
The work performed during rendering each ray can be broken into
,
the initialization work, and
, the work performed to calculate
and shade the samples along the ray. If perfect load balancing is
achieved for every ray, each processor will perform
work per ray, that is, the initialization time is replicated for
every ray. If
, then we can achieve very high scalability
with the algorithm, otherwise, as the number of processors increases
the amount of work done on the initialization by all the processors
gets larger than
, thus limiting the performance. This
makes optimization of the initialization time critical to the
performance of the algorithm.
Initialization time is composed of several components, the most time consuming being the PARC projection time and the transformation time. Right now it takes anywhere from 350 msec to 1600 msec to scan convert a level 4 PARC approximation on the machines we used. We believe scan conversion itself can be done on the order of 20 times faster when the code is re-written in a more efficient way, possibly in i860 specific code. By broadcasting at the beginning only the necessary PARC polygons we can also avoid increasing the number of polygons that need to be scan converted in each processor, and at the same time decrease the memory requirement. Until these changes get incorporated in our code, our timings are going to be around the 1 second mark, even if the rest of the algorithm takes no time at all. However, overall rendering times decreased substantially after we optimized our transformation time by factoring out all common matrix multiplications and inlining the ones inside tight loops.
All of the performance numbers presented in the rest of the section are for the Intel Paragon. The Intel Paragon uses an Intel i860XP, a 50MHz superscalar microprocessor, and a 2D mesh interconnection network. Every processor of the machine actually contains two i860XPs, but only one is used for computation.
Figures 7 and 8 show the
average time to composite different image sizes in three different
machine configurations and for five different screen sizes, ignoring
completely the rendering time. We use a slow start technique for
these measures (only when pipelining). It is interesting to compare
the figures as one can see that our pipelining method can very well
hide the effects of the network and the work done to composite the
image. For instance, every processor has to spend around 52 msec to
composite a
image (only compute time), if we consider 6 processors,
it will take over 300 msec of CPU time to generate this image,
still with our pipelining approach the user only sees 65 msec
as opposed to 330 msec a sequential composite requires.
Figure 7: Timing as seen by a user of the
arriving of images using our pipelining approach.
Figure 8: Timing as seen by a user of the
arriving of images using sequential composite.
In Figure 9, we present some of our rendering times. These
are rendering times for our first implementation and should not be
regarded as what we are expecting for the production level code. One
can see from the graph that our algorithm scales well as the number of
processors increases. Also our prediction that the higher the shading
cost, the better the parallel scalability can be seen from the graphs.
We have filtered out the PARC rendering time from the numbers. We
expect to speed the PARC projection times up by at least 20 times with
the new scan conversion routines, and with a new set of fast PARC
projection techniques being designed we anticipate getting scan conversion
to under 25 msec. At this time, the best rates attainable by our
algorithm are about 1.5 frames/sec on a 32 processor configuration of
our Intel Paragon for a
image size. This is very competitive
and even better than other rendering times published for machines with
this number of processors.
Figure 9: Rendering times on an Intel Paragon.