Next: Hardware Overview Up: The Streamline Computing Cluster Previous: Network Security Contents

Parallel Computation

Your system is designed primarily with the aim of running MPI (Message Passing Interface) parallel jobs, although the batch queueing system is also able to handle scalar or OpenMP jobs should you wish.

In a message passing parallel code each process (or cpu) on which the parallel code is running executes the same binary code, but executes a different path through the code (Single Program Multiple Data - SPMD). For example the Fortran code :

      program hello
      include 'mpif.h'
      call mpi_init(ierr)
      call mpi_comm_rank(mpi_comm_world,my_rank,ierr)
      call mpi_comm_size(mpi_comm_world,np,ierr)
      print *,'My process number is ', my_rank, ' of ',np, 'procs'
      call mpi_finalize(ierr)
      stop
      end

may produce the following output when run on a 4 processor cluster ¹:

[nrcb@vmserve pbs]$ mpif77 -o hello hello.f
[nrcb@vmserve pbs]$ scrun -nodes=4 ./hello
SCORE: Connected (jid=2)
<0:0> SCORE: 4 nodes (4x1) ready.
 My process number is  1 of  4procs
 My process number is  3 of  4procs
 My process number is  2 of  4procs
 My process number is  0 of  4procs
[nrcb@vmserve pbs]$

MPI also allows memory to be copied from one process to another enabling codes running on different machines to exchange data. This is where most if the interest in message passing codes lie. In general a message passing parallel job first partitions the data associated with a particular task and sends a separate portion to each node. Each node works on its own part of the data. In order to solve the global problem each node must communicate some of its own results to the other processes. Finally the global answer is delivered to the front end where it is printed out or stored on file. A rather trivial MPI code illustrates this:

      program sum
      include 'mpif.h'
      call mpi_init(ierr)
      call mpi_comm_rank(mpi_comm_world,myid,ierr)
      call mpi_comm_size(mpi_comm_world,np,ierr)
      x = myid + 1
      print *,' The value of x on processor ', myid, ' is ', x
      call mpi_allreduce(x,xsum,1,mpi_real,mpi_sum,mpi_comm_world,ierr)
      if ( myid.eq.0) then
         print *,' Sum of all the x''s is ' , xsum
      endif
      call mpi_finalize(ierr)
      stop
      end

and when compiled and run using 4 cpu's :

[nrcb@vmserve pbs]$ mpif77 -o sum sum.f
[nrcb@vmserve pbs]$ scrun -nodes=4 ./sum
SCORE: Connected (jid=5)
<0:0> SCORE: 4 nodes (4x1) ready.
  The value of x on processor  1 is   2.
  The value of x on processor  3 is   4.
  The value of x on processor  2 is   3.
  The value of x on processor  0 is   1.
  Sum of all the x's is   10.
[nrcb@vmserve pbs]

In sharing a task in this way a multi-processor system can potentially solve problems much faster that a single processor one. In addition because each processor only requires a portion of the global data for a problem, then accordingly much larger problems can be tackled by a parallel machine.

Of course the downside to message passing codes is that they are very much more difficult to write as compared to scalar or shared memory codes. Secondly, the system bus on a modern PC can pass in excess of 4 Gbits /sec between the memory and cpu, whereas fast ethernet can only pass 200 Mbits per second between machines over a single ethernet cable. We see that there is a potential bottleneck when passing data between compute nodes unless we use a high performance network solution.

As well as MPI there exist other equivalent message passing libraries that are freely available. For instance BSP (Bulk Synchronous Process), and PVM (Parallel Virtual Machine). MPI is still under active development and continues to have bug fixes and upgrades which are publicly available. It is regarded worldwide as the standard message passing system. The advantage of using MPI is that it is by far the most widely used system. Most commercial parallel codes use MPI rather than alternative message passing libraries. Secondly many hardware vendors, realising the popularity of MPI, will only release drivers for their hardware which works under MPI. For example both Myricom and Dolphinics companies produce hardware specially designed for message passing systems and operating at 2 Gbits/s. Effective drivers for MPI are available for these products.

The availability of high performance networking and associated software drivers allows us to build scalable parallel machines. In other words if we run an MPI code on such a machine we would expect the execution speed to increase directly with the number of processors, even up to a large number of cpu's. For example using 10 cpu's we would expect the code to run (approximately) 10 times faster, using 20 cpu's 20 time faster and so on.

A PC cluster networked solely by a single fast ethernet, in contrast, is in general not scalable. The execution time of a code on such a machine may reach a minimum for as few as 4 or 8 processors and then start to increase, depending upon the application. In other words if a machine is not scalable then doubling the number of cpu's may actually cause the code to perform worse.

Next: Hardware Overview Up: The Streamline Computing Cluster Previous: Network Security Contents

2004-06-17