Runtime support for workshare regions in OMPi

VVD & the OMPi team

May 2007
December 2006
June2005

1 Introduction

OpenMP includes three worksharing directives that differentiate the work done by the threads of a team: single, sections, for. At the end of the code region associated with these directives there is an implied barrier for synchronization. This is to say that the threads always work in the same region, before proceeding to the next one.

However, OpenMP provides a nowait option, whereby the implied barrier is avoided. This way, it is possible for different threads to move to different workshare regions, depending on their relative speed.

A directive will be called blocking if it does not have the nowait clause, otherwise it is non-blocking. Non-blocking directives create bookkeeping problems for the runtime library. This is because in worksharing regions, the library needs to maintain the state of threads so as to give out work to them. For example, iterations of a loop (especially with the guided and dynamic schedules) are subject to contention and are given away on a first-come first-served manner; fast threads may get more iterations.

Things would be much simpler if it was guaranteed that all threads work in the same region, as the library would only have to maintain one set of bookkeeping data. However the presence of non-blocking regions makes this very difficult.

In what follows, we present the problem in more detail and describe the implementation in OMPi’s runtime library.

2 The problem

For all three directives, the runtime library needs some kind of counters to count the number of threads that have passed through their regions. For example, in the sections case we need a counter x so as to assign the xth section to the xth arriving thread. For a single region, all but the the first thread that arrives should not execute the region; in order to do this we need a counter (or a flag). The counter (or the flag) should be reset when the last thread passes through, so a counter, to discover which the last thread is, is needed again.

Keeping just one counter, servicing all workshare regions is problematic (actually, impossible) in the presence of nowait directives. The problems are three in number:

So bookeeping with just one counter is not feasible. Problem 1 can be actually solved quite easily if we keep 3 different counters, one for the single regions, one for the sections regions and one for the for regions.

Problem 2 is tuffer. A complete solution would keep in total K counters, one for each of the directives inside the parallel region. This, however, requires help from the compiler: the compiler numbers the directives, giving each one a unique id. This id is then used by the library to index the corresponding counter. While it may seem that such a help can be easily provided¹, this is not the case in reality. For one, the complier must also consider the orphaned workshare directives that might be called from within the parallel region. Things get more difficult if those orphaned directives appear in external functions (i.e. in other files) where the complier might not even have access to (such as e.g. in a pre-compiled library).

Even if Problem 2 was solved as suggested above, we still have Problem 3 to solve, as illustrated in the following piece of code:

The unique single region is utilized repeatedly, 10 times. All threads will encounter the region 10 times and each time only one of them will execute it. Threads proceed with different speeds and, due to the nowait clause, chances are that different threads may have encountered the region a different number of times at any given moment. As a result, even if this region had its own bookeeping counter, this would not be enough! We need 10 different counters just for this region so as to know how many threads encounter the region for the ith time, for each i.²

3 Solutions to the problems

There exist quite simple solutions to the problems, but they are not always efficient. For example, the Omni compiler has adopted the following scheme:

In contrast, OMPi allows different threads to be in different regions and never assigns work statically. The solution currently adopted is highly optimized and does not even require help from the compiler.

Assume we have a team of N threads and a series of non-blocking workshare regions inside a parallel construct. We will use the term active region to denote a workshare region where:

When all N threads pass through, the region is no longer active (it will have “drained” out).

The problems of the previous section appear because at any given time there may exist many active regions. Notice that the active regions may be actually different regions in the code or the same region encountered many times. Thus all 3 problems are actually one. If we knew the maximum possible number of active regions, we could use a different counter for each one of them and solve all problems. Of course, a practical issue is that at runtime we cannot possibly know this, since due to Problem 3 the number of active regions may be completely unknown.

Constructing and keeping a dynamically allocated structure for the counters (i.e. allocate a new counter each time a new region is activated—as was done in OMPi versions up to 0.8.2) incurs, however, quite an overhead. Starting from version 0.8.3, a better solution is adopted in OMPi. There exist in total MAX counters, arranged in an array of MAX positions. Thus, there is a limit MAX active regions. If at some point a thread moves to activate another region (i.e. a completely new region or a new encounter of a previously met region) while there are MAX active regions, the thread blocks waiting for an active region to be completely drained.

This scheme is relatively simple and if MAX is not small, it allows a significant number of regions to be simultaneously active. Furthermore it does note require any help at all from the compiler.

4 Workshare regions without nowait

The case of blocking workshare regions is quite simple. This is because it is guaranteed that while such a region is active, all arriving threads will stick to it due to the barrier at the end of the region. Thus at most 1 such region can be active and consequently only 1 counter is needed for the whole program.

5 Details of the implementation

The implementation lies in files ort.c and ort_prive.h which consitute the core of OMPi’s runtime library (ort). There are MAXACTIVEREGIONS (MAX for short) active regions allowed. For each one there is an associated data structure to keep counters and related stuff (wsregion_t). Thus there are MAX such structures (in ort_workshare_t) plus one more for the blocking regions. ort_workshare_t finally contains two counters to determine the number of active regions, a headregion which is incremented whenever a new region gets activated and a tailregion which is incremented whenever a region is drained out (all arithmetic is done mod MAX). All this bookkeeping info is held at the parent thread of the team. Each member of the team manipulates only two variables:

When a thread enters a region (see enter_workshare_region() in ort.c), it uses either the single structure for blocking regions (if the region is a blocking one) or otherwise, the structure of the ith non-blocking region, where i = mynextNWregion. In the latter case, the first thing the thread does is check whether it has to block or not. This is the case when the number of active regions has reached MAX and the thread is at the head region, i.e. headregion-tailregion==MAX && mynextNWregion==headregion. When the thread unblocks (or if it did not block at all), it increments its mynextNWregion and proceeds.

As can be seen from the definition of wsregion_t, each region structure has only one counter (notleft) which counts the number of threads that have not left the region yet, and a flag that becomes zero when the first threads enters the region (empty).

The very first thread to enter the region finds empty equal to 1 and makes it 0. It also increments headregion by 1, if the region is a non-blocking one. Finally, it sets the notleft equal to the number of threads in the team and does a few more initializations specific to the type (single, sections or for) of the workshare directive.

When a thread leaves a region (see enter_leave_region() in ort.c), it decrements the notleft counter for its region by 1 and checks if it becomes equal to 0; if so, the thread is the last one to leave the region and must reset this region’s empty flag to 1 as well as advance the tailregion by 1.

6 Some more details

Well, for all three workshare constructs, special actions need to be taken only by two threads: the first to arrive and the last to leave. This is why we only need a flag (empty), instead of a counter for the number of threads that entered the workshare region. This flag is has a nice side effect, too. Entering threads do not have to use locks to check this flag; they could do a raw, unprotected check. If the flag is 0 then a thread proceeds immediately, since it is not the first to arrive, for sure. Otherwise, it has to do an atomic check through locking (where possibly it will discover it is the first to enter the region).

Also, the implementation uses an optimization to boost performance. Normally, when a new team of threads is created, all MAX region structures must be properly initialized (i.e. all their empty flags should be set to 1, the associated locks should be initialized, etc.). However this is quite an overhead for the parent of the team to do. To avoid this, the parent initializes only the first of the regions and defers the initialization of the (i + 1)th region to the first thread to enter the ith region. This way, the initialization can be done in parallel. This is why an extra field (inited) is needed in wsregion_t.

7 The maximum number of active regions

One parameter that must be wisely defined is the maximum number of allowable active regions (MAXACTIVEREGIONS or MAX for short). In OMPi it is by default equal to 50. However, it might be more logical to associate it somehow with the expected number of running threads per team.

Anyways, non-blocking regions are not usually employed widely in common applications, so MAX = 50 has proven a more-than-enough number. Through artificial benchmarks we have found that the gains are small for larger MAX.

We close by taking a look at the NAS benchmark suite (2.3) which includes 8 applications that make use of non-blocking regions.

¹Actually, this solution was adopted in OMPi up to version 0.8.2

²OMPi versions up to 0.8.2 used a dynamically allocated linked list of counters for each region.