next up previous
Next: Performance Analysis Up: Parallel Performance Measures for Previous: Load Balancing

Parallel Ray Casting

 

The version of our parallel PARC-based volume ray caster described here uses the NX/2 library on the iPSC/860 and the Paragon, although previous versions also ran under TCP/IP on workstations. We plan to release a production level version of this code on the iPSC/860, Paragon, PVM, network workstations (TCP/IP) together with a distribution version of VolVis [1].

In order to avoid the processors having direct access to the dataset description files, we chose to broadcast once the necessary information for the rendering, like the dataset, processor assignments, transfer functions, and so on, and have all the processors synchronize during this phase. Clearly, this may make our implementation unsuitable for someone that needs to generate only a single image, specially because some machines (like the iPSC/860) have slow processor access to NFS mounted files. The best scenario is one where the datasets are generated on the parallel machine, and not moved in and out at all.

After initialization, user commands representing different viewing angles are sent to all the processors by broadcast messages. Only information like transformation matrices and image sizes are sent at this time to minimize the communication cost per image. In order to avoid flooding the parallel machine with requests, a feedback synchronization technique is used. It basically balances the requests rate with the machine power available. The flow of messages in the algorithm is shown in Figure 5.

  
Figure 5: Overview of communication flow in the algorithm. Arrow width represents the expected bandwidth necessary in each communication link.

The feedback synchronization techniques we use are based on work by Van Jacobson [7], who designed a set of techniques to avoid congestion in TCP/IP networks. We use a variation of his slow-start and round-trip-time estimation technique, where the host slowly sends requests and adaptively changes the rate of requests with the feedback it receives from the network. This is implemented by having the host keep the number of outstanding image render requests, and setting a maximum on this number based on the number of processors and the amount of memory each has. At the start of the computation the host begins sending image requests to the processors, and for every image received it sends two requests to the processors until the maximum is achieved. Also the host keeps a running average of the time taken to compute an image, computed as , where is the new estimate, was the initial estimate, M is the time measured in the last image computation, and is an amortization constant. By changing we can make the host more or less responsive to changes in rendering times. By using this procedure, when this time increases the host can adaptively decrease the rate of requests or increase the rate if the processors begin computing images faster.

In the computing processors, a set of working requests are queued and serviced on demand. Basically, there are two different kinds of requests, rendering requests and compositing requests. The first type of request is received directly from the host (where supposedly the user is waiting for images to show up), while the second comes from other slab neighboring processors. The computing processor keeps servicing both types of requests, by picking a message from each queue.

While servicing a rendering request the processor allocates enough memory for it, renders it, and keeps the rendered image around until a compositing request for that particular image comes from its respective neighboring processor, then composites its part of the image and sends it over to the other neighboring processor, and continues working on a new rendering request. The last processor on a chain, sends the whole image back to the user's workstation. If a compositing request is received for an image that is not rendered yet we take the approach of computing it right away rather then delaying it as this could double our memory requirements for images. Once requests are serviced the memory is immediately freed.

This approach is simple and effective. One of the clear advantages is that if we disregard the message and synchronization overhead for a moment, we see that we are maximizing the computation overlap among processors and getting a much better utilization of the communication network as messages are being sent during the whole course of the image computation time instead of just at a certain point in time.

One may claim that other, tree based, compositing schemes [9] may yield better results, however, the drawbacks of these schemes (low processor utilization during compositing and high network utilization during the peak of compositing) are major. Even though the tree approach would give a final image in time steps, it still needs asymptotically the same number of messages. Therefore, it does not save any computation time, but it actually wastes it when some of the processors become idle.

The use of a pipelined compositing approach, where images are asynchronously generated and saved in buffers, requires the use of the feedback synchronization technique to avoid increasing the memory overhead without bounds. An interesting side effect of this technique is that our algorithm automatically adjusts itself to the rendering times of the particular machine and/or configuration being used, like the number of processors and network performance.



next up previous
Next: Performance Analysis Up: Parallel Performance Measures for Previous: Load Balancing



Claudio Silva
Thu Apr 20 15:18:55 EDT 1995