The ray casting compositing scheme simply accumulates for each image pixel the opacity given a direction and a starting voxel in the 3D dataset. This gives rise to two simple ways to parallelize the basic sequential algorithm:
Because of the longer access associated with its distributed shared memory, memory partition on the BBN TC2000 is particularly important. The delay to access non-local memory will substantially affect the performance of the algorithms. The 3D dataset memory has been scattered across shared memory by Z planes, with each processor storing one or more planes (depends on the number of processors being used). The 2D image buffer is also scattered across memory with each processor keeping one or more lines of the image. All other information such as (geometrical) description of the volume, color tables and such are stored in local memory.
Figure 3: Difference in speedup of scan-line and pixel based ray casting
If all operating system and memory contention overhead were ignored, the two approaches above should give the same performance but as shown in Figures 3 the results are quite different. Even though volume rendering is very computational intensive, the overhead of generating a new task for each pixel limits the algorithm scalability. The algorithm works by having a single processor, the coordinator, generates the tasks for the others. The coordinator will clearly become a bottleneck if the number of processors is big in relation with the size of the time of computation of the tasks. Also because the trilinear interpolation is needed at every step in the computational in a ray and it needs to have the eight closest vertices of that ray. The scan line method will be able to take better advantage of the locality of the access(cache use). The speedups for this algorithm are shown in Figure 3.