20 Oct 2022

OpenMP Optimizations

Who knew parallelisation would be so easy!

I recently worked on a project that involved extensive use of OpenMP. It might be restricted in its abilities but believe me when I say this, it is one of the easiest and the most beginner friendly frameworks to work with when it comes to concurrency and parallel programming.

While the technical details are not important, it primarily works on compiler directives. Often times outperforming both std::async and std::threads, it definitely heavily optimises your code. The most obvious case where you'd benefit from parallel code is loops. It has the best support for loops. I was asked multiple times during my project why I preferred this over something native to C++ like std::threads. There are a multitude of reasons.

threads is STL. As with all things, STL implementations suck when it comes to anything other than competitive programming (too bad Codeforces does not support these).
threads asks you to write purely functional … well everything. It becomes an unnecessary pain to even think about complicated algorithms that absolutely need a non constant data structure. Sure, functional programming is beautiful, and mastery over it is probably a complete command over programmable logic, but after a certain point everything just makes things hard.
It does not (more like cannot) inline usual functions. Functors, yes maybe, but why do you want me to wrap everything in a class of its own and add more assembly overhead just for the compiler to know what the function signature is!

These are just a few of the many issues with it. I haven't studied async in much detail, so I don't think I can comment on it. Some people may also recommend CUDA, it's just too low level for my job. All this being said, just because OpenMP is easy doesn't mean it doesn't come with its own sets of problems. To its credit though, the manual has a solution for pretty much everything you possibly have a query over. ## Problems I faced during the project

Implementation of a thread safe queue

My original plan was to have a heap allocated array with push pointer and a pop pointer. What I required for BFS was a queue with multiple producers and a single consumer. Since my implementation was never going to have the problems of an empty queue when there isn't one, I had the great option of making it lock-free.

Lock free structures are much easier and really very efficient in terms of usage. A lock just slows things down.

There was a simple solution I came up with. Make the push operation atomic. OpenMP by default locks the structure in which atomic operations are done, but since the operations are extremely efficient (atomic operations are really just simple writes, increments and other simple binary operations) the locks had little to no overhead on the program.

My initial push was something along these lines:

struct TSQueue {
    uint64_t push_pointer = 0;
    uint64_t pop_pointer = 0;
    static constexpr uint64_t capacity = 1 << 15;
    uint16_t *ts_queue = new uint16_t[capacity];
    TSQueue() {}
    ~TSQueue() { delete[] ts_queue; }
    void push(uint16_t node) {
        #pragma omp atomic write
        this->ts_queue[push_pointer] = node;
        #pragma omp atomic update //by default it is update
        push_pointer++;
    }
    uint16_t pop() {
        uint16_t var = this->ts_queue[pop_pointer];
        pop_pointer++;
        return var;
    }
};

There is one very obvious problem here, and that is a data race. Suppose we have 2 threads A and B who are about to push the Node into the queue. A sees the value of push_pointer to be some value, say 5. At the same time, B also sees the value of push_pointer to be 5. A writes to the queue at index 5, then B does the same. This is an issue, since we are missing a node in the process. Now an arguably worse thing happens. Both of them increment push_pointer, thereby resulting in a garbage value at ts_queue[6]. This screws up the entire process.

One fix for that is do go deep into the compiler and change the pragma (or redefine it) to have the lock the OpenMP places, to stay till I increment the pointer as well (we technically don't need the atomic since OpenMP guarantees that anything outside the parallel region will be in shared memory, but the race still stays).

Another fix for that was to have the push inlined and pushes happening in a for loop that pushed according to the thread number being used. Then the iteration over the number of vertices can be blocked into chunks of the size of number of threads.

As with all things, however, it wasn't that easy. It came with its own set of problems. It is not necessary that each thread may insert a value into the queue. Hence we need it to have a known garbage value we can check for. This adds an O(n) extra loop (that can be parallelized) but it doesn't change the complexity of the code.

I went with the second approach. Now that demands (as I learned later after hours of debugging), due to the way the loops are structured and nested, that the first t*n2 values in the pointer ts_queue to have a known garbage value. This changes the complexity for the APSP problem from O(n3) to O(n4). The first one was bad enough for a time complexity already. The second one is very obviously much worse.

So I did what no sane person prefers to do apparently. Use static table generation and fill the entirety of ts_queue with a compile time value of -1.

This was a really good exercise for me in general and made me think in ways programmers usually don't think in. So definitely there was a lot to learn from here.

InnocentZero's Treasure Chest

OpenMP Optimizations

Implementation of a thread safe queue