As researchers and engineers use volume rendering to study complex physical and abstract structures they need a coherent, powerful, easy to use visualization tool, that lets them interactively change all the necessary parameters. Unfortunately, even with the latest volume rendering acceleration techniques running on top-of-the-line workstations, it still takes a few seconds to a few minutes to volume render images. This is clearly far from interactive. With the advent of parallel machines, scanners and instrumentation, larger and larger datasets (typically from 32MB to 512MB) are being generated that would not even fit in memory of a workstation class machine. Even if rendering time is not a major concern, big datasets may be expensive to hold in storage, and extremely slow to transfer to a typical workstations over network links.
These problems lead to the question of whether the visualization should be performed directly on the parallel machines that generate the simulation data or sent over to a high performance graphics workstation for post-processing. First, if the visualization software was integrated in the simulation software, there would be no need for extra storage and visualization could be an active part of the simulation. Second, large parallel machines can render these datasets faster than workstations can, possibly in real-time or at least giving the possibility of achieving interactive rates. Finally, if real integration between the simulation and the visualization tool is possible, one could interactively ``steer'' the simulation, and possibly terminate simulations that are wrong or uninteresting at an earlier stage instead of performing long and expensive archiving operations for the generated datasets. In this paper we focus on the architecture and performance measures of visualization algorithms that are running directly on the parallel machines.
Clearly, an algorithm that runs on a parallel machine has to be efficient and should be able to make good use of the computing power. A conservative tradeoff between scalability and actual processing speed is very important. Also, the algorithm has to be space efficient, and for the case of a distributed memory MIMD machine, memory duplication should be avoided. In this paper we propose a space efficient, fast parallel algorithm that addresses these issues. This algorithm will be the basis of a visualization library in the molds just described using the VolVis system [1] as its front end.
A large number of parallel algorithms for volume rendering have been
recently proposed. Schroeder and Salem [13] have
proposed a shear based technique for the CM-2 that could render
volumes at multiple frames a second, using a low quality
filter. The main drawback of their technique is low image quality.
Their algorithm had to redistribute and resample the dataset for each
view change. Montani et al. [10] developed a distributed
memory ray tracer for the nCUBE, that used a hybrid image-based load
balancing and context sensitive volume distribution. An interesting
point of their algorithm is the use of clusters to generate higher
drawing rates at the expense of data replication. However, their
rendering times are well over interactive times. Using a different
volume distribution strategy but still a static data distribution, Ma
et al. [9] have achieved better frame rates on a CM-5. In their
approach the dataset is distributed in a K-d tree fashion and the
compositing is done in a tree structure. Others
[6,3,11] have used similar load balancing
schemes using static data distribution, for either image compositing
or ray dataflow compositing. Nieh and Levoy [12] have
parallelized an efficient volume ray caster [8] and achieved
very impressive performance on a shared memory DASH machine.
In this paper we concentrate on the parallelization of a simple but fast method for ray casting, called PARC (polygon assisted ray casting) [2]. Our parallel implementation uses a static data decomposition and an image compositing scheme. We have implementations that work on the Intel iPSC/860 and the Intel Paragon. In Section 2 we explain the important issues in designing and writing a parallel ray caster, followed by Section 3, where we study a new method for measuring the work done by a ray caster. In Section 4 we describe our algorithm and its implementation.