Multi-GPU programming with multi-(CPU)threads using Managed Cuda

Jul 4, 2015 at 1:32 AM
Hi,

I am using .NET Framework 4.5, with managed Cuda 6.
I have two identical GPUs in my PC and I want to parallelize across the two GPUs by assigning each CPU thread to each GPU.
The code looks like this:
Task task = Task.Run(() =>
                    {
                        int GPU_id=0;
                        _manager.TrainBLSTM(trainDir, isMFCC, PNCCmethod, learningRate, momentum, LSTMLayer, LSTMCell, TotalWeights, IsBackwardStateOn,GPU_id);
                    });
                    
Task task2 = Task.Run(() =>
                    {
                        int GPU_id=1;
                        _manager.TrainBLSTM(trainDir, isMFCC, PNCCmethod, learningRate, momentum, LSTMLayer, LSTMCell, TotalWeights, IsBackwardStateOn,GPU_id);
                    });
_manager.TrainBLSTM is a function that will be run on GPU. With the above code, I
made two threads in CPU and call this function in each thread.
Each thread will call _manager.TrainBLSTM with different GPU_id (0 or 1).
In _manager.TrainBLSTM function, I have created the CUDA context using GPU_id as follows.
CudaContext cntxt = new CudaContext(GPU_id);
But I get the following error.

ErrorInvalidHandle: This indicates that a resource handle passed to the API call was not valid. Resource handles are opaque types like CUstream and CUevent.

It seems that it is not enough to just create cuda context for each thread of CPU using its own
GPU_id and I may need to do cuda context binding or tying the context of the kernels of the function (_manager.TrainBLSTM) to the current thread in each call of a kernel ... .

How I can solve this problem?
Coordinator
Jul 4, 2015 at 10:35 AM
Hi,

I’ll give you first some basics about cuda, contexts and threads to make sure we’re all on the same page: A context binds a GPU-device to the currently active thread (the thread which calls cuCtxCreate(...)). Only this thread has access to the context and any handles or memory allocations created for this context are also only accessible from the initial thread. If you use handles or memory from another thread not bound to the context, you will get an error as you mentioned.

You can change the bound host thread of a context by calling cntxt.SetCurrent() from within any other thread of the same process. Any further cuda interaction must then occur from this new thread. Only one thread can be assigned to one context; which means for two GPUs you need two separate host threads.

Now back to your code in C#. You do create two contexts for your two GPUs, but you do not explicitly create two threads. Instead you’re using tasks and tasks are scheduled on threads provided by a thread pool. The number of threads and the assignment of threads is done automatically by .net, it can also happen, e.g. on a single core CPU, that the thread pool only has one thread and all task run consecutively on the same thread. Long story short: By using tasks you have no control of the thread to use, the thread can change at any time and herewith the bound context. Thus the error message...

To fix this, create either explicitly two threads for each context which do all the work for you; this gives you probably the best control. Or every time you call a CUDA function, call cntxt.SetCurrent() first (and lock the code area so that no other thread can steal the current context). As the lalter method cannot guarantee that both GPUs are occupied in parallel at the same time, I’d go for the first solution: Use threads and not tasks.

Cheers,
Michael
Marked as answer by saeedm on 7/7/2015 at 2:56 PM
Jul 7, 2015 at 10:57 PM
Edited Jul 15, 2015 at 3:21 AM
Thanks Michael.
Jul 15, 2015 at 3:22 AM
Edited Jul 15, 2015 at 10:22 PM
I have tried what you said. I created two separate threads and called my cuda functions two times by using two separate threads. But it still gives me the same error.
Here is my simple test.

My kernel:
#include "cuda_runtime.h"
#include <stdio.h>

extern "C"  

{
    __global__ void kernel(int a, int b, int *c)
    {
        
          *c = ( a *b);
    }
}
 
int main()
{
    return 0;
} 
my code is:
using System;
using System.Threading.Tasks;
using ManagedCuda;
using ManagedCuda.BasicTypes;



class Program
{
    static CudaKernel multiplyWithCuda;

    static int  MultiplyWithCuda(int GPU_id,int a, int b)
    {
        CudaContext cntxt = new CudaContext(GPU_id);
        CUmodule cumodule = cntxt.LoadModule(@"D:\\test\\kernel.cu\\kernel.cu\\Debug\\kernel.ptx");
        multiplyWithCuda = new CudaKernel("kernel", cumodule, cntxt);


        CudaDeviceVariable<int> d_result = new CudaDeviceVariable<int>(1);
        int result_host = 0;
        //cntxt.SetCurrent();
        multiplyWithCuda.Run(a, b, d_result.DevicePointer);
        d_result.CopyToHost(ref result_host);
        Console.WriteLine(result_host);
        Console.ReadKey();
        return result_host;
    }

    

    static void Main(string[] args)
    {
        var result1 = 0;
        var result2 = 0;
        int gpu_id1 = 0;
        int gpu_id2 = 1;
        var thread1 = new System.Threading.Thread(() => result1 = MultiplyWithCuda(gpu_id1, 3, 10));
        thread1.Start();
        var thread2 = new System.Threading.Thread(() => result2 =  MultiplyWithCuda(gpu_id2, 8, 9));
        thread2.Start();
        
    }
}
I see that this sometimes works but most of the time it gives me the same error.
I tried cntxt.SetCurrent() as you can see above but it will not work using it.

How can I solve this? Can you give me an example?

Thanks
Coordinator
Jul 15, 2015 at 12:02 PM
You share the same CudaKernel in both threads. Don't define
static CudaKernel multiplyWithCuda; 
as a static member for all threads, define it locally inside MultiplyWithCuda().
  • Michael
Marked as answer by saeedm on 7/15/2015 at 2:21 PM
Jul 15, 2015 at 10:21 PM
Edited Jul 16, 2015 at 8:52 PM
Thanks so much Michael. It solved my problem. You are great.
Jul 16, 2015 at 8:52 PM
Edited Jul 17, 2015 at 12:51 AM
I have one more question. So far I could use two GPU for parallel processing.
Now I am thinking to create more than two threads (depending on the number of CPU cores which are available).
For example, I have 4 CPU cores available to use and I create 4 separate threads in host.
Then I want to run 2 threads on one GPU and the other 2 threads on another GPU.
So the 2 threads are supposed to be run sequentially on each GPU.

Here what I did for my simple test.
using System;
using System.Threading.Tasks;
using ManagedCuda;
using ManagedCuda.BasicTypes;



class Program
{
   

    static int  AddWithCuda(int GPU_id,int a, int b)
    {
        CudaKernel addWithCuda;
        CudaContext cntxt = new CudaContext(GPU_id);
        CUmodule cumodule = cntxt.LoadModule(@"D:\\test\\kernel.cu\\kernel.cu\\Debug\\kernel.ptx");
        addWithCuda = new CudaKernel("kernel", cumodule, cntxt);


        CudaDeviceVariable<int> d_result = new CudaDeviceVariable<int>(1);
        int result_host = 0;
        //cntxt.SetCurrent();
        addWithCuda.Run(a, b, d_result.DevicePointer);
        d_result.CopyToHost(ref result_host);
        Console.WriteLine(result_host);
        Console.ReadKey();
        return result_host;
    }

    

    static void Main(string[] args)
    {
        var result1 = 0;
        var result2 = 0;
        int gpu_id1 = 0;
        int gpu_id2 = 1;
        var thread1 = new System.Threading.Thread(() => result1 = AddWithCuda(gpu_id1, 3, 8));
        thread1.Start();
        var thread2 = new System.Threading.Thread(() => result2 =  AddWithCuda(gpu_id2, 8, 9));
        thread2.Start();
        var thread3 = new System.Threading.Thread(() => result2 = AddWithCuda(gpu_id2, 18, 2));
        thread3.Start();
        var thread4 = new System.Threading.Thread(() => result1 = AddWithCuda(gpu_id1, 13,6));
        thread4.Start();
        
    }
}
As you can see my code, I did not use cntxt.SetCurrent(). It seems that my code is working with no error.
Do you think this is okay and there would not be any problem in real project if I do multiprocessing as simple as that?
I know that you said I should use cntxt.SetCurrent() and also lock the code area so that no other thread can steal the current context but it seems this is working without that. Am I missing something?
Also I am not sure how can I lock the code area so that no other thread can steal the current context?
(Please give me an example if it is possible).

Please also note that this is not my project and just a simple test. I am doing Neural network training
on huge data size and so the data that will be transferred to GPU can be huge (I am not sure that this may need
some extra considerations to do the mutiprocessing or not.)

Thanks,
Saeed
Coordinator
Jul 16, 2015 at 11:31 PM
You can of course create multiple contexts for one GPU (just think of multiple CUDA application running on the same computer). You can also create multiple contexts from within the same host-thread; but only one context can be “active” to one thread. If you have multiple host threads with each a context of its own, depending on the GPU capabilities, multiple context can also execute code at the same time on the same GPU.

So what you’re currently doing is totally fine. There are just a few things to think about and I have no thorough knowledge regarding all details, but:
  • Is there an upper limit of contexts per GPU? I’m not sure about that…
  • Do multiple contexts per GPU slow down each other? I guess so…
  • Can’t you break down your problem to a cleaner solution? I.e. the task per GPU is well defined and does not overlap with other tasks on CPU, etc.? Best way to go is definitely one context per GPU, independent on the number of host threads doing other stuff...
As you have in your example only one context per thread and this context is only defined in the scope of one thread (No other thread can actually steal it), you also need no locking. To answer your question: usually there are multiple ways to make a code segment thread safe. The probably easiest way is to use C#’s lock keyword and use the context as locking object.
Marked as answer by saeedm on 7/16/2015 at 4:21 PM
Jul 17, 2015 at 12:22 AM
Thanks for your quick reply. You are awesome Michael.