The algorithm used is the optimized ray casting. It is described by
Levoy [7,8]. In order to speed up ray casting,
octree opacity enumeration, early ray termination and adaptive ray
tracing
are used.
The volume dataset is simply partitioned interleaved among the clusters. The work load, i.e. image plane pixels, is divided in tiles that are assigned in equal parts to the processors. To achieve load balance the algorithm lets processors steal work from each other. More specifically, initially the algorithm loads the data set in an interleaved fashion among the clusters, it then creates a queue of image pixels for each processor. At this point each processor will begin its task. If a processor finishes its work, it is allowed to look for work in the other processors queue. This approach tends to minimize the amount of time a processor stays idle.
This approach keeps the optimized ray casting algorithm unmodified, because it preserves the front-to-back rendering order. Also as each processor will trace the whole ray, it can make use of early ray termination and opacity octree enumeration.
One problem is that adaptive ray tracing could possibly generate the need for global synchronization, but this is mostly avoided by assigning disjoint square tile region of the image to different processors. The performance results follow.