OMPi’s runtime-library interface for FOR directives

Abstract

This document presents briefly the code OMPi’s parser produces and the calls the runtime library provides to handle parallel for regions.

1 Introduction

Assume the following general piece of OpenMP code, where without loss of generality we assume that ub ≥ lb and step > 0:

where xxx is one of static, dynamic, guided or runtime. Notice that if the terminating condition was i <= q, OMPi considers the equivalent form of i < ub, where ub = q + 1.¹

Since runtime ends up being (at runtime, not compile time) one the other three schedules, we will forget it for the moment. Also, chunksize may not be given; in such a case it is assumed to have a value of -1. The general transformation done by OMPi is the following:

   /* OMPi-generated code for dynamic and guided schedules */
   {
     int niters, iter, first, last, extra = 0;
     ...
     niters = ort_num_iters(lb, ub, step);  /* Total # iterations */
     get_chunk_func = ort_get_xxx_chunk;    /* Depending on the schedule type */
     while ( (*get_chunk_func)(niters, &first, &last, &extra) )
     {
       for (iter = first; iter < last; iter++)
       {
         i = lb + iter * step;              /* Calculate index variable */
         < LoopBody >
       }
     }
   }

where xxx is one of static, dynamic or guided (‘ort_’ stands for OMPi-Run-Time). The idea is that there are 3 different functions (one for each schedule type) which get the loop’s parameters and return a chunk of consecutive iterations for the thread to execute. This chunk’s first iteration is returned in first and its last one in last. Every thread calls repeatedly the function and each time it gets a new chunk to execute, until the function returns 0 (“false”). The extra variable is actually a dummy (ignored) parameter which is used only in a specific case in the runtime schedule case. We can safely ignore it for now. Also notice that, initially, the total number of iterations is caclulated (niters) and in each iteration, the actual value of the index variable (i) is dervied from the iteration number (item).

[ Actually, the inner loop is implemented as:

for (iter = first, i = lb + first*step; iter < last; iter++, i += step)
< LoopBody >

which is faster; we stick to the “slow” version here for presentation purposes. ]

This workframe fits nicely the dynamic and guided schedule types. The only issue with those is that the chunk a thread gets depends on the chunks already given away—i.e. there is contention for the chunks among threads. This requires some global bookeeping which is done at the parent of the team of threads (this explains the need for the forloop field in the workshare structure of the threads). In particular, the parent of the team must remember the next iteration to be given away. Functions ort_get_dynamic_chunk() and ort_get_guided_chunk() read and modify this (atomically) in order to decide on the next chunk to assign.

Things are a bit different, though, for the static schedule. This is because there is no contention for the chunks. The chunks are statically assigned to every thread and this is where optimizations can be achieved. A thread could calculate all the chunks / iterations it will get right from the beginning, without requiring a runtime call to some get_chunk_func for every chunk. The required calculations are described elsewhere. In conclusion, OMPi does not actually have an ort_get_static_chunk() function and the generated code applies only to the other two schedule types.

This clearly breaks the symmetry of the produced code, requiring different code to be produced for the static case, but it is an optimization worth having. We note though that we will be in need of the produced code’s symmetry later, when we discuss the runtime schedule.

2 Static schedules

There are two cases to consider. The first is the default static, where no chunksize was specified. In this case, each thread gets at most one chunk of iterations to execute. Thus there is no need to surround the for with a while block. The produced code is the simplest possible:

   /* OMPi-generated code for static schedules with no chunksize */
   {
     int niters, iter, first, last;
     ...
     niters = ort_num_iters(lb, ub, step);  /* Total # iterations */
     if ( get_get_static_default_chunk(niters, &first, &last) )
       for (iter = first; iter < last; iter++)
       {
         i = lb + iter * step;              /* Calculate index variable */
         < LoopBody >
       }
   }

Function ort_get_static_default_chunk() is called once and returns 0 (“false”) if no chunk should be executed by the thread.

The second case is more complex in that, depending on the specified chunksize, each thread may execute a series of chunks. The complete details are given in OMPi’s document “Static schedule for OpenMP parallel for”, June 2003, in case somebody wants to go into more depth. For everybody else, it suffices to give the final code (Fig. 1) and explain it a bit. The N threads get chunks in a round-robin fashion (thus the chid += N in the outer for), thread x starting from the xth chunk (hence the chid = omp_get_thread_num()). For each chunk, the first and last iterations are calculated easily and the rest is the same as in the previous cases.

   /* OMPi-generated code for static schedules with given chunksize */
   {
     int niters, iter, first, last;
     int chid /* chunk id */, N = omp_get_num_threads();

     niters = ort_num_iters(lb, ub, incr);
     for (chid = omp_get_thread_num(); ; chid += N)  /* My chunks */
     {
       first = chid * chunksize;        /* first iteration of this chunk */
       if (first >= niters) break;
       last = first + chunksize;        /* last iteration of this chunk */
       if (last > niters) last = niters;     /* take care of last chunk */

       for (iter = first; iter < last; iter++)
       {
         i = lb + iter * step;              /* Calculate index variable */
         < LoopBody >
       }
    }
    ort_leaving_for();

Figure 1: Produced code for the case of static schedules with a specified chunksize.

3 The runtime schedule

Well, we have already mentioned that this type of schedule ends up being one of the other three types. However, the decision is deferred to the run time, not the compile time. This creates the following needs:

   /* OMPi-generated code for the runtime schedule */
   {
     int niters, iter, first, last, extra = 0;
     ...
     niters = ort_num_iters(lb, ub, step);  /* Total # iterations */
     ort_get_runtime_schedule_stuff(&get_chunk_func, &chunksize);
     while ( (*get_chunk_func)(niters, &first, &last, &extra) )
     {
       for (iter = first; iter < last; iter++)
       {
         i = lb + iter * step;              /* Calculate index variable */
         < LoopBody >
       }
     }
   }

There is an ort_get_runtime_schedule_stuff() library call which returns the appropriate get_chunk_func function and the chunksize, as determined by the value of the OMP_SCHEDULE environmental variable.

It is obvious that, for the dynamic and guided cases, there is nothing more to be said or done. For the static schedule, though, we end up where we started from! We have to provide a compliant get_chunk_func for it only because of the runtime schedule.

From the previous sections, it should be obvious that this function will no longer provide for optimized code. Fortunately, this is not always the case. In particular, if no chunksize is given, a thread will get at most one chunk, so that at most two calls to ort_get_runtimestatic_chunk() will be made. On the other hand, if a chunksize is given and its value is very small (1 is the worst case), then every few (one at worst) iterations, a call to ort_get_runtimestatic_chunk() will be made. But even so, we don’t really care that much because the runtime schedule is used mostly for experimentation and for tuning the code. This is a poor excuse to avoid saying that, actually, even if we did care, there is not much we could do.

In order to make ort_get_runtimestatic_chunk() work, there are a few issues to be settled. In particular, this function needs to know which is the current chunk given away to the calling thread so as to assign the next chunk. Remember that the chunks are not subject to contention among threads. Each thread gets particular chunks that are calculated based on the loop’s parameters and its thread id. Since the chunks are given away one by one, ort_get_runtimestatic_chunk() has to know what was the last chunk assigned to each thread.

This information can be stored in two ways. One way is to keep it in the thread’s control structure and have ort_get_runtimestatic_chunk() retrieve it from and alter it there, for each thread. However this is not clean plus it actually requires each thread to keep a stack of such things due to possible nesting of parallel fors. The other way, which is what OMPi does, is to have the parser declare a private variable for each thread and pass it by reference to ort_get_runtimestatic_chunk(). This is the infamous ‘extra’ variable that we (and all other functions) have ignored up to now. This is initialized to 0, for each thread and represents the ‘id’ of the next chunk to be calculated.

  int ort_get_runtimestatic_chunk(int niters, int chunksize,
                                    int *first, int *last, int *chunkid)
{
    if (chunksize == -1) /* No chunksize given */
    {
      if (*chunkid) return (0);     /* Only 1 chunk will be given per thread! */
      *chunkid = 1;
      return ( ort_get_static_default_chunk(niters, fiter, liter) );
    }
    else                 /* chunksize given */
    {
      if (*chunkid == 0)    /* my very first chunk */
        *chunkid = omp_get_thread_num();
      else
        (*chunkid) += omp_get_num_threads();
      *first = chunksize * (*chunkid);
      if (*first >= niters)
        return (0);
      *last = *first + chunksize;
      if (*last > niters)
        *last = niters;
      return (1);
    }
  }

Figure 2: Library code for supporting runtime schedules that have a static value.

Fig. 2 shows the ort_get_runtimestatic_chunk() implementation. If you followed the previous sections closely, you will see that this function simply uses the ‘optimized’ routines mixed with the ‘optimized’ code the parser would produce, to give out the chunks.

¹This, however, breaks down if q = INT_MAX. One should avoid using such loop bounds when the <= operator is used for termination.