In a message passing parallel code each process (or cpu) on which the parallel code is running executes the same binary code, but executes a different path through the code (Single Program Multiple Data - SPMD). For example the Fortran code :
program hello include 'mpif.h' call mpi_init(ierr) call mpi_comm_rank(mpi_comm_world,my_rank,ierr) call mpi_comm_size(mpi_comm_world,np,ierr) print *,'My process number is ', my_rank, ' of ',np, 'procs' call mpi_finalize(ierr) stop endmay produce the following output when run on a 4 processor cluster 1:
[nrcb@vmserve pbs]$ mpif77 -o hello hello.f [nrcb@vmserve pbs]$ scrun -nodes=4 ./hello SCORE: Connected (jid=2) <0:0> SCORE: 4 nodes (4x1) ready. My process number is 1 of 4procs My process number is 3 of 4procs My process number is 2 of 4procs My process number is 0 of 4procs [nrcb@vmserve pbs]$MPI also allows memory to be copied from one process to another enabling codes running on different machines to exchange data. This is where most if the interest in message passing codes lie. In general a message passing parallel job first partitions the data associated with a particular task and sends a separate portion to each node. Each node works on its own part of the data. In order to solve the global problem each node must communicate some of its own results to the other processes. Finally the global answer is delivered to the front end where it is printed out or stored on file. A rather trivial MPI code illustrates this:
program sum include 'mpif.h' call mpi_init(ierr) call mpi_comm_rank(mpi_comm_world,myid,ierr) call mpi_comm_size(mpi_comm_world,np,ierr) x = myid + 1 print *,' The value of x on processor ', myid, ' is ', x call mpi_allreduce(x,xsum,1,mpi_real,mpi_sum,mpi_comm_world,ierr) if ( myid.eq.0) then print *,' Sum of all the x''s is ' , xsum endif call mpi_finalize(ierr) stop endand when compiled and run using 4 cpu's :
[nrcb@vmserve pbs]$ mpif77 -o sum sum.f [nrcb@vmserve pbs]$ scrun -nodes=4 ./sum SCORE: Connected (jid=5) <0:0> SCORE: 4 nodes (4x1) ready. The value of x on processor 1 is 2. The value of x on processor 3 is 4. The value of x on processor 2 is 3. The value of x on processor 0 is 1. Sum of all the x's is 10. [nrcb@vmserve pbs]
In sharing a task in this way a multi-processor system
can potentially solve problems much faster that a single processor one.
In addition because each processor only requires a portion of the
global data for a problem, then accordingly much larger problems can be tackled
by a parallel machine.
Of course the downside to message passing codes is that they are
very much more difficult to write as compared to scalar or shared memory
codes. Secondly, the system bus on a modern PC can pass
in excess of 4 Gbits /sec between the memory and cpu, whereas fast ethernet can only
pass 200 Mbits per second between machines over a single ethernet cable.
We see that there is a potential bottleneck when passing data
between compute nodes unless we use a high performance network
solution.
As well as MPI there exist other equivalent message passing libraries
that are freely available. For instance BSP (Bulk Synchronous Process),
and PVM (Parallel Virtual Machine). MPI is still under
active development and continues to have bug fixes
and upgrades which are publicly available. It is regarded worldwide as the standard
message passing system. The advantage of using MPI is that
it is by far the most widely used system. Most commercial parallel
codes use MPI rather than alternative message passing libraries.
Secondly many hardware vendors, realising the popularity of MPI, will
only release drivers for their hardware which works under MPI.
For example both Myricom and Dolphinics companies produce hardware specially
designed for message passing systems and operating at 2 Gbits/s. Effective
drivers for MPI are available for these products.
The availability of high performance networking and associated software
drivers allows us to build scalable parallel machines. In other words
if we run an MPI code on such a machine we would expect the execution speed
to increase directly with the number of processors, even up to a large
number of cpu's. For example using 10 cpu's we would expect the code to run
(approximately) 10 times faster, using 20 cpu's 20 time faster and so on.
A PC cluster networked solely by a single fast ethernet, in contrast, is in general not scalable. The execution time of a code on such a machine may reach a minimum for as few as 4 or 8 processors and then start to increase, depending upon the application. In other words if a machine is not scalable then doubling the number of cpu's may actually cause the code to perform worse.