Some models of performance for parallel computers algorithm have been developed and the one described here is based on Stone [22]. Even though one needs to take into account the architectural overhead imposed by the system, by only counting the time of computation and the time of communication and synchronization, one can analyze the performance of algorithms on a given architecture.
In more specific terms, one has to assume the algorithm will have
several tasks and that it can be parallelized by running these tasks
concurrently. Let R be the number of units of time spent in the
computation of the tasks and let C be the number of units of time
spent in communication and any other overhead imposed by the chosen
algorithm. The ratio
can be used to determine the optimal point
of performance in a given architecture.
The ratio
can be used as a measure of task granularity:
is large, each unit
of computation produces a little communication.
is small, a considerable part of
the time is spent in communication.
A general rule of thumb is that if the ratio
is very small, then
one can use more processors in hopes that the communication is
to be spread over the processors. The ratio remains the same and
one can use a larger number of less powerful processors for the same
task. If on the other hand
is big, the optimum number of
processors will usually be smaller, and more powerful processors can
be used.
Another very important issue in parallel algorithm design is to match
the communication patterns supported by the machine being used. For
instance, in a shared memory machine, one may want to interleave the
data in memory to avoid hot spots
. Also, if the machine supports local memory, one
may want to allocate all the data that is not needed by the other
processor tasks in local memory to minimize bus bandwidth.
For distributed memory machines, one may want to interleave or limit global communication phases so that the network traffic does not hurt the performance. All of these issues are highly architecture and algorithm dependent but the general idea of minimizing and organizing communication to avoid contention for data is very useful in practice.
One important measure of performance for a parallel algorithms is its
speedup. This is probably the most misused measure in
benchmarking. While most authors define speedup to be the ratio
of the running time
of the best sequential algorithm for
the problem to the running time T of the parallel algorithm (
), it is often incorrectly measured as the ratio of the time
it takes for the parallel algorithm to run in one node configuration
to the time to run the algorithm in n nodes of the same machine.
Parallel algorithms for MIMD machines are discussed in [53], for a discussion on Teraflops machines (of the future) see [21], and for general information on high performance systems, see [22,15]. In the rest of this paper, several parallel algorithms will be described together with the specific machine used for the implementation.