Q on passing C# type arrays to CUDA kernels via Managed CUDA

Feb 12, 2014 at 3:39 PM
Edited Feb 12, 2014 at 5:59 PM
(EDIT: please substitute any/all "&#43" that may display with the plus symbol; don't know why this editor likes to mess with my plus symbols, especially in code snippets.)

I'm having great difficulty testing Managed CUDA. C and C plus plus code works flawlessly, but I cannot say the same about Managed CUDA and C#, and the sheer lack of examples makes this very frustrating. Consider this example...

I have the following 2 kernel code functions:
__global__ void squareIntKernel(const int a, const int b, int *c)
{
    *c = (a + b)*(a + b);
}

__global__ void squareIntArrayKernel(const int* a, const int* b, int* c)
{
    int i = threadIdx.x;
    c[i] = (a[i] + b[i])*(a[i] + b[i]);
}
In both cases I am simply attempting to execute "(1 plus 2) squared" which should output 9; that's it. In the first case, I am passing C# int32's, in the second case I am passing C# int32 arrays of one element each. The goal is to output the exact same value from both functions considering that the inputs are identical in value. My C# arrays are as follows:
int[] a = new int[] {1}; 
int[] b = new int[] {a[0] + 1};.
My first C# function that calls the "squareIntKernel" function is:
        static public Func<int, int, int> cudaSquareInt = (a, b) =>
        {
            if (squareIntWithCuda == null)
                InitKernels();

            // init output parameters
            CudaDeviceVariable<int> result_dev = 0;
            int result_host = 0;

            // run CUDA method
            squareIntWithCuda.Run(a, b, result_dev.DevicePointer);

            // copy return to host
            result_dev.CopyToHost(ref result_host);

            return result_host;
        };
In my little test app, I use a button to execute the function call:
        private void brnIntTest_Click(object sender, EventArgs e)
        {
            int[] val1 = new int[] {Convert.ToInt32(mTxtIntTest.Text)};
            int[] val2 = new int[] {(val1[0] + 1)};

            AppendLine(this.txtOutput, String.Format("{0} using cudaSquareInt - INPUT:  ({1} + {2})^2", this.lblIntTest.Text, val1[0], val2[0]));

            int[] val3 = new int[] {CudaClass.cudaSquareInt(val1[0], val2[0])}; 

            AppendLine(this.txtOutput, String.Format("Result:  {0}", val3[0]));
       }

..which results in the following output lines:
<int> Squares Test using cudaSquareInt - INPUT:  (1 + 2)^2
Result:  9
All good so far, but I need to run real parallel processing with much larger data sets which means using arrays, hence my second C# function which calls the "squareIntArrayKernel" function:
        static public Func<int[], int[], int[]> cudaSquareIntArray = (a, b) =>
        {
            if (squareIntArrayWithCuda == null)
                InitKernels();

            // init output parameters
            CudaDeviceVariable<int> result_dev = new CudaDeviceVariable<int>(a.Length);
            int[] result_host = new int[a.Length];
            
            //Set grid and block dimensions                       
            squareIntArrayWithCuda.GridDimensions = new dim3(8, 1, 1);
            squareIntArrayWithCuda.BlockDimensions = new dim3(512, 1, 1);

            // run CUDA method
            squareIntArrayWithCuda.Run(a, b, result_dev.DevicePointer);

            // copy return to host
            result_dev.CopyToHost(result_host);

            return result_host;
        };
The above compiles fine but fails to execute, generating the following error in the CudaEvent.Synchronize() method:
ErrorUnknown: This indicates that an unknown internal error has occurred.
Following another example, I modified my C# function code to create and use device variables as follows:
        static public Func<int[], int[], int[]> cudaSquareIntArray = (a, b) =>
        {
            if (squareIntArrayWithCuda == null)
                InitKernels();

            //- Copy input parameters to device
            CudaDeviceVariable<int> a_dev = new CudaDeviceVariable<int>(a.Length);
            a_dev.CopyToDevice(a);
            CudaDeviceVariable<int> b_dev = new CudaDeviceVariable<int>(b.Length);
            a_dev.CopyToDevice(b);

            // init output parameters
            CudaDeviceVariable<int> result_dev = new CudaDeviceVariable<int>(a.Length);
            int[] result_host = new int[a.Length];
            
            //Set grid and block dimensions                       
            squareIntArrayWithCuda.GridDimensions = new dim3(8, 1, 1);
            squareIntArrayWithCuda.BlockDimensions = new dim3(512, 1, 1);

            // run CUDA method
            squareIntArrayWithCuda.Run(a_dev.DevicePointer, b_dev.DevicePointer, result_dev.DevicePointer);

            // copy return to host
            result_dev.CopyToHost(result_host);

            return result_host;
        };
This code also compiles, but additionally executes without error -- a step in the right direction. I then execute the code again with a button click event, this time calling the " cudaSquareIntArray" function:
        private void brnIntTest_Click(object sender, EventArgs e)
        {
            int[] val1 = new int[] {Convert.ToInt32(mTxtIntTest.Text)};
            int[] val2 = new int[] {(val1[0] + 1)};

            AppendLine(this.txtOutput, String.Format("{0} using cudaSquareIntArray - INPUT:  ({1} + {2})^2", this.lblIntTest.Text, val1[0], val2[0]));

            int[] val3 = CudaClass.cudaSquareIntArray(val1, val2); 

            AppendLine(this.txtOutput, String.Format("Result:  {0}", val3[0]));
        }
Only now I get bad results. My first run produces this output:
<int> Squares Test  using cudaSquareIntArray - INPUT:  (1 + 2)^2
Result:  4
...my second press of the button yields this:
<int> Squares Test  using cudaSquareIntArray - INPUT:  (1 + 2)^2
Result:  1052676
..and my third press of the button yields:
<int> Squares Test  using cudaSquareIntArray - INPUT:  (1 + 2)^2
Result:  252446724
At this point, I really do not understand what Managed CUDA is doing and how to utilize it within a C# program. Can anyone tell me what I am doing wrong here, AND provide a functional example of how to pass a C# type array into a CUDA kernel for execution using Managed CUDA?
Feb 12, 2014 at 5:45 PM
Edited Feb 12, 2014 at 6:10 PM
Additionally, in case this has some bearing on my issue, I am developing on a 64-bit Windows 7 OS.

The following is the debug log from the first run that gives a bad result of 4:
2/12/2014 2:09:52 PM, cuDeviceGetCount: ErrorNotInitialized
2/12/2014 2:09:52 PM, cuInit: Success
2/12/2014 2:09:52 PM, cuDeviceGetCount: Success
2/12/2014 2:09:52 PM, cuDeviceGet: Success
2/12/2014 2:09:52 PM, cuCtxCreate: Success
2/12/2014 2:09:52 PM, cuModuleLoad: Success
2/12/2014 2:09:52 PM, cuModuleGetFunction: Success, Kernel: _Z15squareIntKerneliiPi
2/12/2014 2:09:52 PM, cuFuncGetAttribute: Success, Kernel: _Z15squareIntKerneliiPi
2/12/2014 2:09:52 PM, cuFuncGetAttribute: Success, Kernel: _Z15squareIntKerneliiPi
2/12/2014 2:09:52 PM, cuFuncGetAttribute: Success, Kernel: _Z15squareIntKerneliiPi
2/12/2014 2:09:52 PM, cuFuncGetAttribute: Success, Kernel: _Z15squareIntKerneliiPi
2/12/2014 2:09:52 PM, cuFuncGetAttribute: Success, Kernel: _Z15squareIntKerneliiPi
2/12/2014 2:09:52 PM, cuFuncGetAttribute: Success, Kernel: _Z15squareIntKerneliiPi
2/12/2014 2:09:52 PM, cuFuncGetAttribute: Success, Kernel: _Z15squareIntKerneliiPi
2/12/2014 2:09:52 PM, cuModuleGetFunction: Success, Kernel: _Z20squareIntArrayKernelPKiS0_Pi
2/12/2014 2:09:52 PM, cuFuncGetAttribute: Success, Kernel: _Z20squareIntArrayKernelPKiS0_Pi
2/12/2014 2:09:52 PM, cuFuncGetAttribute: Success, Kernel: _Z20squareIntArrayKernelPKiS0_Pi
2/12/2014 2:09:52 PM, cuFuncGetAttribute: Success, Kernel: _Z20squareIntArrayKernelPKiS0_Pi
2/12/2014 2:09:52 PM, cuFuncGetAttribute: Success, Kernel: _Z20squareIntArrayKernelPKiS0_Pi
2/12/2014 2:09:52 PM, cuFuncGetAttribute: Success, Kernel: _Z20squareIntArrayKernelPKiS0_Pi
2/12/2014 2:09:52 PM, cuFuncGetAttribute: Success, Kernel: _Z20squareIntArrayKernelPKiS0_Pi
2/12/2014 2:09:52 PM, cuFuncGetAttribute: Success, Kernel: _Z20squareIntArrayKernelPKiS0_Pi
2/12/2014 2:09:52 PM, cuMemAlloc: Success
2/12/2014 2:09:52 PM, cuMemcpyHtoD: Success
2/12/2014 2:09:52 PM, cuMemAlloc: Success
2/12/2014 2:09:52 PM, cuMemcpyHtoD: Success
2/12/2014 2:09:52 PM, cuMemAlloc: Success
2/12/2014 2:09:52 PM, cuCtxSynchronize: Success, Kernel: _Z20squareIntArrayKernelPKiS0_Pi
2/12/2014 2:09:52 PM, cuEventCreate: Success
2/12/2014 2:09:52 PM, cuEventCreate: Success
2/12/2014 2:09:52 PM, cuEventRecord: Success
2/12/2014 2:09:52 PM, cuLaunchKernel: Success, Kernel: _Z20squareIntArrayKernelPKiS0_Pi
2/12/2014 2:09:52 PM, cuEventRecord: Success
2/12/2014 2:09:52 PM, cuEventSynchronize: Success
2/12/2014 2:09:52 PM, cuEventElapsedTime: Success
2/12/2014 2:09:52 PM, cuEventDestroy: Success
2/12/2014 2:09:52 PM, cuEventDestroy: Success
2/12/2014 2:09:52 PM, cuMemcpyDtoH: Success
Coordinator
Feb 12, 2014 at 9:21 PM
Hi Krazee,

first of all, have you seen the samples package in the download area? (Downloads). These are ports to C# from some samples in the original CUDA SDK and should give a basic idea on how things work.

In your code I can find two things going wrong:
1) You pass host pointers to a kernel (an int[] in C#), hence the error message (internal error...) So copy your data first to a CudaDeviceVariable<int> and pass the DevicePointer.
2) You lunch the kernel on a block of size 512 and use threadIdx.x as array index, but your array is only of size 1 -> you read/write outside of array bounds. I'm actually wondering that this runs at all. And as your Grid is of size 8, you perform this operation 8 times, hence the wrong values.

If you fix these two issues, everything should run as expected.

Cheers,
Michael
Feb 14, 2014 at 6:31 PM
Michael,

Thanks much for the response to my question, and especially for the link to the samples package. Sorry about the late response, but a snow storm here on the eastern U.S. seaboard had me home bound, and apparently a Remote Desktop connection over VPN does not count as a CUDA capable device on the host system; hence why I wasn't able to test much yesterday.

Regarding your comments...

1) Glad to know I was on the right path by adjusting my code to use the CudaDeviceVariable class; considering the results I was getting it is indeed a relief to know that this is the correct approach.
2) If possible, could you further explain your reasoning regarding the block and grid sizes as a potential cause to my issue? Read on first, because I did get my code to work, but the grid/block size had zero impact -- which is to say that when I figured out what was wrong, the code executed successfully regardless of the block/grid size settings (tried everything from 1/1 up to 512/8). I only ask this question to gain more perspective on CUDA processing.

Here was the solution:

After much testing the fix was two-fold:

First, my code had a bug which was:
            //- Original Code
            CudaDeviceVariable<int> a_dev = new CudaDeviceVariable<int>(a.Length);
            a_dev.CopyToDevice(a);
            CudaDeviceVariable<int> b_dev = new CudaDeviceVariable<int>(b.Length);
            a_dev.CopyToDevice(b);
The b_dev variable was not getting initialized at all.
            //- Corrected Code
            CudaDeviceVariable<int> a_dev = new CudaDeviceVariable<int>(a.Length);
            a_dev.CopyToDevice(a);
            CudaDeviceVariable<int> b_dev = new CudaDeviceVariable<int>(b.Length);
            b_dev.CopyToDevice(b);  //<--- initialize the b_dev variable
Second. once the above code was working I always got the same result in all consecutive button presses, regardless if I changed the input values between button presses; that is to say, the initial result was the only result that I could ever obtain. To fix, I had to add the following lines to my "cudaSquareIntArray" function code:
            // copy return to host
            result_dev.CopyToHost(result_host);

            a_dev.Dispose();  //<-- Must dispose a_dev before returning 
            b_dev.Dispose();  //<-- Must dispose b_dev before returning

            return result_host;
I'm going to review nVidia's CUDA documentation to understand a bit more about the block size and grid size settings as I'm unclear how they would possibly have an impact (or in fact, a total lack of impact) on my test code.

Again, thanks for your timely response... it was certainly very helpful and appreciated.

Regards,
~Kevin
Coordinator
Feb 14, 2014 at 8:31 PM
Hi Kevin,

I won't try to explain the idea of grids and blocks in Cuda, many people already did this in much more detail and probably better than what I could write here in a few words. Just two links to some material from Nvidia that should give an idea: An Easy Introduction to CUDA C and C++ / Introduction to CUDA C.

What is missing in your code is the adaption of you grid and block sizes to the size of your data arrays. How you do this and why you should do it in a specific way is described in the above links (and many others). For an array size of 1, block and grid size should be both (1,1,1), i.e. one thread in one block.

Regarding your second thought using
a_dev.Dispose();  //<-- Must dispose a_dev before returning 
b_dev.Dispose();  //<-- Must dispose b_dev before returning
This is half right and half wrong. Assuming your array is still of size one, if your block size is larger than one thread, you read and write outside of allocated memory, which results in unpredictable behavior. Calling a_dev.Dispose() frees device memory and might also clean up the mess you created before in your kernel, why it seems to run fine. But you are more like curing symptoms than fixing the cause. (Cuda always allocates larger data chunks in order to get properly aligned starting addresses. If you allocate 4 bytes (= one int32), Cuda will allocate something like 512 bytes (not sure about the exact number), why your kernel doesn't throw an exception.)
Whereas mostly all managedCuda classes implement the IDisposable interface, it is the programmers duty to take care of properly freeing unmanaged resources by calling Dispose() at the right time. Meaning if you allocate memory on device by creating an instance of CudaDeviceVariable, you have to free that part of memory manually by calling Dispose(), similar to malloc() and free() in plain old C; device memory is not garbage collected in managedCuda!
So you were right to call Dispose() and don't forget to also free result_dev, but this is not the reason why your sample does or does not work.

Finally, GPUs are parallel hardware, so don't try to understand things with an array of size 1, play with larger chunks of data like 10000 elements. Doing so, you will see how blocks and grids have an impact on your results...

Best
Michael