Using m_fork for parallel processing

"Sequent Parallel Programming Primitives"

Written by Paul Bourke
January 1998

There are many ways of taking advantage of machines with multiple processors. One characteristic of the many methods is the degree to which the developer modifies the code from his/her knowledge of what is going on.

Totally automatic methods may be as simple as using compiler flags (eg: -pca). An example of a semiautomatic method is the use of compiler directives in the code (eg: #pragma parallel). The user locates those portions of the code that may run in parallel and the compiler is given information on how that might be accomplished. A totally manual method is one where the developer actually writes the C code in parallel.

The first totally automatic method described above is certainly the easiest but it is also the one which is likely to yield the lowest speed increase. There is no harm trying totally automatic methods but the authors experience with -pca in particular has been unsatisfactory, there have been serious bugs with the PCA compiler in the past!

The following describes one method which was used to speed up existing software to take advantage of a 12 processor SGI Power Challenge. The code existed and was relatively simple, the time consuming parts consisting of long "for" loops. The operations within these loops were independent and thus chunks of the outermost loops could readily be split up and run in parallel. The particular approach was to use the routines

   int m_get_numprocs(void);
   int m_set_procs (int numprocs);
   int m_fork (void (*func)(), ...);
   int m_get_myid();
   int m_kill_procs();
The basic idea is as follows, consider a C function which does something significant inside a large for loop. Note the code presented here is intentionally made trivial in order to illustrate the ideas.
void DoSomething(void)
   int i;

   for (i=0;i<N;i++) {
      do something, preferably significant, N preferable large
      where the i1th iteration is independent of any other i2th iteration

This can be modified to work with m_fork() as shown below:
void DoSomething(void)
   int i;
   int nproc,theid;

   nproc = m_get_numprocs();
   theid = m_get_myid();

   for (i=theid*N/nproc;i<(theid+1)*N/nproc;i++) {

      whatever was in the loop before goes here unmodified

This obviously relies on the fact that the process id's increase by 1 from 0 (parent).

The code that calls this function might do something like

   flag = m_fork((void (*)())DoSomething);
The for loop is simply split across 7 processes which on a multiple processor machine will be different processors. Interestingly even on a single processor machine the above can yield significant speed improvements as a result of cache coherency.


Any attempt at parallelisation needs to be carefully tested as the relationship between available processors, machine load, and parallel tasks is not easily predictable.

The following shows the speed up factor for a 12 processor machine under "commonly" encountered load conditions. The loop size is increased in powers of 10 from 100, the number of processes increased from 1 to 14. The lines plotted are the ratio of the time taken to the time taken for a single processor.

While the numerical details of this graph are not important, the general characteristics are common. Note in particular:


The code I use for performing timing when evaluating parallel algorithms is
#include <sys/resource.h>
#include <sys/time.h>

double GetRunTime(void)
   struct rusage ru;
   double sec = 0;

   sec += ru.ru_utime.tv_sec;
   sec += ru.ru_utime.tv_usec * 1e-6;
   sec += ru.ru_stime.tv_sec;
   sec += ru.ru_stime.tv_usec * 1e-6;

It returns the time in seconds, call before and after a piece of code, and take the difference.

m_fork Man Pages

M_FORK(3P)                                                          M_FORK(3P)

     m_fork, m_kill_procs, m_set_procs, m_get_numprocs, m_get_myid, m_next,
     m_lock, m_unlock, m_park_procs, m_rele_procs, m_sync  - parallel
     programming primitives

     #include <ulocks.h>
     #include <task.h>

     int m_fork (void (*func)(), ...);
     int m_kill_procs (void);
     int m_set_procs (int numprocs);
     int m_get_numprocs (void);
     int m_get_myid (void);
     unsigned m_next (void);
     void m_lock (void);
     void m_unlock (void);
     int m_park_procs (void);
     int m_rele_procs (void);
     void m_sync (void);

     The m_fork routine creates n-1 processes that execute the given func in
     parallel with the calling process.  The processes are created using the
     sproc(2) system call, and share all attributes (virtual address space,
     file descriptors, uid, etc.).  Once the processes are all created, they
     each start executing the subprogram func taking up to six arguments as
     passed to m_fork. The arguments passed must not be larger than pointers
     in size, i.e. floating point numbers must be passed by reference.

     The processes execute the subprogram and wait until they all return, at
     which point the m_fork call returns.  The other threads are left busy-
     wait spinning until the master again calls m_fork.  The overhead of
     creating the additional process was done in the first call to m_fork and
     is not repeated.  m_fork sets a flag (see prctl(2)) so that if any of the
     threads terminate, they will all receive a SIGTERM signal.  Certain
     operations such as profiling require that the various threads exit
     normally (via the exit(2)) call.  This can best be done by using
     m_kill_procs to terminate the slave threads, then have the master itself

     The number of subtasks n can be set and queried using m_set_procs and
     m_get_numprocs, where the default is the number of processors in the
     system (and hence the maximum number of processes that can truly be run
     in parallel).  Note that although up to 256 tasks may be requested, the
     busy-wait rendezvous mechanism will usually result in a large performance
     loss with any more threads than processors.

     When the processes are created, each is assigned a unique thread
     identifier (its tid).  This identifier can be obtained through
     m_get_myid. Thread id's range from 0 to n-1.

     A global counter and a global lock are provided to simplify
     synchronization between the processes.  On each m_fork call, the counter
     is reset to zero.  The counter value is gotten and post incremented
     through the m_next routine.  The first time m_next is called, it returns
     a zero.  The global lock is set and unset through m_lock and m_unlock.

     The m_park_procs and m_rele_procs are provided to suspend and resume the
     child processes created by m_fork. This is useful if you have a phase of
     the program where the parent will do setup or reinitialization code and
     you do not want to have the children spinning and wasting resources.
     m_park_procs should not be called when processes are already suspended.

     m_sync is provided to synchronize all threads at some point in the code.
     When m_sync is called by each thread, it waits at that point for all
     other threads to call m_sync. The global counter is reset, and all
     threads resume after the m_sync call.

     m_kill_procs terminates the processes created from the previous m_fork.

     Most errors from m_fork come from either sproc(2) or usinit(3P).  See
     those manual pages for more information on errors.

     These primitives are based on the Sequent Computer Systems parallel
     programming primitives, but may not conform to all Sequent semantics.

     Diagnostic output analyzing the failures in m_lock(3P) and the routines
     that m_fork calls can be gotten using the _utrace mechanism discussed in
     the usinit(3P) manual page.

     sproc(2), blockproc(2), prctl(2), barrier(3P), usinit(3P), ussetlock(3P).

     m_fork, m_set_procs, m_park_procs, m_rele_procs, and m_kill_procs return
     a 0 when successful, and a -1 with errno set upon failure.
     m_get_numprocs, m_get_myid, and m_next all return integers.  m_lock,
     m_unlock, and m_sync return no value.