Ray casting is the basis for the parallel algorithm described below. The parallel algorithm has three parts. First it must divide the task among the processors, second each processor will compute its part of the image and then all the processors will send their computed image parts to the host.
One idea used in the algorithm is to divide the hypercube in smaller
hypercube clusters, replicate in each cluster the whole dataset, but
give to each cluster a different part of the image. The reason for
this procedure is to minimize the number of messages inside each
cluster. To illustrate the idea, imagine a situation where all the
processors want to communicate with each other. If one had only one
cluster with n processors, in theory, each processor could be
communicating with each other at any one instance in time, this gives
raise to
messages. If one has c clusters the number
of messages in each cluster
goes down to
, thus giving a real gain in bus bandwidth. The following
description assumes, without lost of generality, only one cluster.
Figure 5: Slice division of the volume dataset on the nCUBE.
The same division is replicated in each cluster.
The data is divided among processors as shown in Figure 5. The division of the object space follows the division of data. After a processor has received its data it will begin the computation by intersecting all the possible pixels from the image plane with its part of the object space, thus determining which rays to trace.
A processor
will only follow a given ray inside its own domain,
if a ray has gone outside its boundary it will get sent to the
processor
that contains that data,
will send the computed
value back when it is done with that ray. One problem with this simple
approach is if processor
has a full buffer when processor
sends the message. That could cause a deadlock as processor
can
not terminate its computation of the pixel value as it may loose
valuable information while its buffer is full. To avoid deadlock, all
messages required acknoledmengts and there is a maximum number of
messages a processor can send to another.
To gather information at the end, a token approach is used. Every processor receives a token in the beginning of the computation, once it is done, it sends its token, together with its computed image to the next processor closer to the image plane, as the process continues the closest processor to the image plane sends its token to the host, that keeps track when the computation is finished. The communication pattern used and the token scheme is illustrated in Figure 6.
Figure 6: Communication pattern for ray tracing on the nCUBE. Worker
W4 is the closest to the image plane, it sends the last token to the host.
The performance of the algorithm just described is claimed to be
very good. The authors claimed to have an efficiency
higher than 0.80 for most processors configurations. The
time of the computation of their algorithm is as low as 4.75 seconds
for datasets smaller than
, when using 128 processors and
clusters of size 2. It is clear that in general, the smaller the
cluster, the faster their algorithm performs.