0

I have a data structure hash table, which has the linear probing hash scheme and is designed as lock-free with CAS.

The hash table

constexpr uint64_t HASH_EMPTY = 0xffffffffffffffff;

struct OnceLock {

    static const unsigned LOCK_FRESH   = 0;
    static const unsigned LOCK_WORKING = 1;
    static const unsigned LOCK_DONE    = 2;

    volatile unsigned lock;

    __device__ void init() {
        lock = LOCK_FRESH;
    } 

    __device__ bool enter() {
        unsigned lockState = atomicCAS ( (unsigned*) &lock, LOCK_FRESH, LOCK_WORKING );
        return lockState == LOCK_FRESH;
    }

    __device__ void done() {
        __threadfence();
        lock = LOCK_DONE;
        __threadfence();
    }

    __device__ void wait() {
        while ( lock != LOCK_DONE );
    }
};
template <typename T>
struct agg_ht {
    OnceLock lock;
    uint64_t hash;
    T payload;
};


template <typename T>
__global__ void initAggHT ( agg_ht<T>* ht, int32_t num ) {
    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < num; i += blockDim.x * gridDim.x) {
    ht[i].lock.init();
    ht[i].hash = HASH_EMPTY;
    }
}


// returns candidate bucket
template <typename T>
__device__ int hashAggregateGetBucket ( agg_ht<T>* ht, int32_t ht_size, uint64_t grouphash, int& numLookups, T* payl ) {
    int location=-1;
    bool done=false;
    while ( !done ) {
        location = ( grouphash + numLookups ) % ht_size;
        agg_ht<T>& entry = ht [ location ];
        numLookups++;
        if ( entry.lock.enter() ) {
            entry.payload = *payl;
            entry.hash = grouphash;
            entry.lock.done();
        }
        entry.lock.wait();
        done = (entry.hash == grouphash);
        if ( numLookups == ht_size ) {
            printf ( "agg_ht hash table full at threadIdx %d & blockIdx %d \n", threadIdx.x, blockIdx.x );
            break;
        }
    }
    return location;
}

Then I have a minimal kernel as well as the main function, just to let the hash table run. An important thing is the hash table is annotated with __shared__, which is allocated in the shared memory in an SM for fast accesses. (I did not add any input data with cudaMalloc there to hold the example minimal.)

#include <cstdint>
#include <cstdio>

/**hash table implementation**/

constexpr int HT_SIZE = 1024;

__global__ void kernel() { 
    __shared__ agg_ht<int> aht2[HT_SIZE]; 
    {
        int ht_index;
        unsigned loopVar = threadIdx.x;
        unsigned step = blockDim.x;
        while(loopVar < HT_SIZE) {
            ht_index = loopVar;
            aht2[ht_index].lock.init();
            aht2[ht_index].hash = HASH_EMPTY;
            loopVar += step;
        }
    }

    int key = 1;
    int value = threadIdx.x;

    __syncthreads();

    int bucket = -1;
    int bucketFound = 0;
    int numLookups = 0;
    while(!(bucketFound)) {
        bucket = hashAggregateGetBucket ( aht2, HT_SIZE, key, numLookups, &(value));
        int probepayl = aht2[bucket].payload;
        bucketFound = 1;
        bucketFound &= ((value == probepayl));
    }
}

int main() {
    kernel<<<1, 128>>>();
    cudaDeviceSynchronize();
    return 0;
}

The standard way to compile it, if the file is called test.cu: $ nvcc -G test.cu -o test

I have to say, this hash table would always give me the correct answer during concurrent insertions under huge-sized input. However, when I ran racecheck on it, I saw Errors everywhere:

$ compute-sanitizer --tool racecheck ./test
========= COMPUTE-SANITIZER
========= Error: Race reported between Write access at 0xd20 in /tmp/test.cu:61:int hashAggregateGetBucket<int>(agg_ht<T1> *, int, unsigned long, int &, T1 *)
=========     and Read access at 0xe50 in /tmp/test.cu:65:int hashAggregateGetBucket<int>(agg_ht<T1> *, int, unsigned long, int &, T1 *) [1016 hazards]
========= 
========= Error: Race reported between Write access at 0x180 in /tmp/test.cu:25:OnceLock::done()
=========     and Read access at 0xd0 in /tmp/test.cu:30:OnceLock::wait() [992 hazards]
========= 
========= Error: Race reported between Write access at 0xcb0 in /tmp/test.cu:60:int hashAggregateGetBucket<int>(agg_ht<T1> *, int, unsigned long, int &, T1 *)
=========     and Read access at 0x1070 in /tmp/test.cu:103:kernel() [508 hazards]
========= 
========= RACECHECK SUMMARY: 3 hazards displayed (3 errors, 0 warnings)

I was confused, that I believe this linear-probing hash table can pass my unit test but has data race hazards everywhere. I suppose those hazards are irrelevant for the correctness. (?)

After a while of debugging, I still could not get the hazard errors away. I strongly believe the volatile is the cause. I was hoping someone might be able to shed some light on it and give me a hand to fix those annoying hazards.

I also hope this question could reflect some design idea on the topic: data structure on shared memory. During searching on StackOverflow, what I saw is merely plain raw array in shared memory.

0

1 Answer 1

1

I suppose those hazards are irrelevant for the correctness. (?)

I wouldn't try to certify the "correctness" of your application or algorithm. If that is what you are looking for, please just disregard my answer.

I was hoping someone might be able to shed some light on it

A shared memory race condition occurs when one thread writes to a location in shared memory, and another thread reads from that location, and there is no intervening synchronization in the code to ensure that the write happens before the read (or perhaps, more correctly, that the written value is visible to the reading thread). This is not a careful, exhaustive definition, but it suffices for what we are dealing with here.

In so far as that definition goes, you certainly have that activity in your code. One specific case that is being flagged is one thread writing here:

        entry.hash = grouphash;

and another thread reading the same location here:

    done = (entry.hash == grouphash);

Inspecting your code we can see that there is no __syncthreads() statement between those two code positions. Furthermore, due to the loop that encompasses that activity, there are more than one hazard associated with this (there are two).

The other interaction being flagged is one thread writing to lock here:

        entry.lock.done();

and another thread reading the same lock location here:

    entry.lock.wait();

The hazard reported here are actually being reported against other lines of code because these are both function calls. Again, there is no intervening synchronization.

I acknowledge that due to the looping nature of your application, I'm not sure it's necessary for "correctness" that either of these thread to thread communication paths get picked up at the earliest opportunity. However, I have not studied your application carefully, nor do I intend to state anything about correctness.

and give me a hand to fix those annoying hazards.

As it happens, both of these interactions are in a small section of your code, so we can cause these 3 hazards to go away with the following additions, according to my testing:

    __syncthreads();  // add this line
    entry.lock.wait();
    done = (entry.hash == grouphash);
    __syncthreads();  // add this line

The first sync intersects the obvious write-read connections between the lines I have already indicated. The second sync is needed due to the looping nature of the code at this point.

Also note that proper usage of __syncthreads() is such that all threads in the threadblock can reach that sync point. A quick perusal of what you have here didn't suggest to me that the above lines/additions would need to be handled carefully, but you should confirm that and be aware of that for general application/usage. It may be that the while bucketFound loop would create a situation here that should be handled differently, however the compute-sanitizer --tool synccheck did not report any issues, running on V100, with the additions I suggested here.

Sign up to request clarification or add additional context in comments.

2 Comments

Hi Robert. Thanks for the info. Since it is a hash table, we can't assume all threads in a block would have the same execution path and reach the same point of __syncthreads( ). If the kernel is more complex, that hash table insertion is only a fraction of the kernel, then this assumption could not always hold. In this complex case, do you also have a tip for a valid solution?
Yes, the fix is rather mechanical. 1. don't allow any threads to exit. 2. Condition the areas where you have only a few threads active appropriately. 3. Don't condition the __syncthreads() statements. Therefore, only the threads you select are doing the work, but all thread participate in the __syncthreads() statement. This gives the general idea. It allows your code to have conditional character while also satisfying the requirements for __syncthreads().

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.