This project is read-only.

Can't seem to get a basic vector addition kernel to work.

Dec 24, 2013 at 3:41 PM
Hi,

I've just been having a play around with managed cuda, and can't seem to get it to work for me. I have the following kernel (taken pretty much straight of a blog somewhere
#define _SIZE_T_DEFINED
#ifndef __CUDACC__
#define __CUDACC__
#endif
#ifndef __cplusplus
#define __cplusplus
#endif


#include <cuda.h>
#include <device_launch_parameters.h>
#include <texture_fetch_functions.h>
#include "float.h"
#include <builtin_types.h>
#include <vector_functions.h>

extern "C"
{
    __global__ void add_vectors(int* a, int* b, int* out, int N)
    {
        int i = blockDim.x * blockIdx.x + threadIdx.x;
        printf("i = %d", i);
        if (i < N)
            out[i] = a[i] + b[i];
    }
}
and the following code that executes it
        static CudaKernel _addWithCuda;
        const int VectorSize = 5120;
        const int ThreadsPerBlock = 256;

            var cntxt = new CudaContext();
            var cumodule =
                cntxt.LoadModule(@"kernel.ptx");
            _addWithCuda = new CudaKernel("add_vectors", cumodule, cntxt)
                {
                    BlockDimensions = ThreadsPerBlock,
                    GridDimensions = VectorSize/ThreadsPerBlock + 1
                };

            var vectorA = Enumerable.Range(1, VectorSize).ToArray();
            var vectorB = Enumerable.Range(1, VectorSize).ToArray();

            // init parameters
            var vectorDeviceA = new CudaDeviceVariable<int>(VectorSize);
            vectorDeviceA.CopyToDevice(vectorA);
            var vectorDeviceB = new CudaDeviceVariable<int>(VectorSize);
            vectorDeviceB.CopyToDevice(vectorB);
            var vectorDeviceOut = new CudaDeviceVariable<int>(VectorSize);

            // run cuda method
            _addWithCuda.Run(vectorDeviceA.DevicePointer, vectorDeviceB.DevicePointer, vectorDeviceOut.DevicePointer, VectorSize);
            
            // copy return to host
            var output = new int[VectorSize];
            vectorDeviceOut.CopyToHost(output);

            for (var i = 0; i < VectorSize; i++)
            {
                Console.WriteLine("{0}+{1}={2}", vectorA[i], vectorB[i], output[i]);
            }
            cntxt.FreeMemory(vectorDeviceA.DevicePointer);
            cntxt.FreeMemory(vectorDeviceB.DevicePointer);
            cntxt.FreeMemory(vectorDeviceOut.DevicePointer);

            cntxt.UnloadKernel(_addWithCuda);
            cntxt.UnloadModule(cumodule);

            Console.ReadKey();
Yet when I run it, it complains when it hits the _addWithCuda.Run() command.

The error is as follows:
ErrorLaunchFailed: An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointer and accessing out of bounds shared memory. The context cannot be used, so it must be destroyed (and a new one should be created). All existing device memory allocations from this context are invalid and must be reconstructed if the program is to continue using CUDA.

When I run it with NSight, I get the following output:
CUDA context created : 0222c950
CUDA context created : 02a1c950
CUDA context created : 02e4c950
CUDA context created : 030cc950
CUDA context created : 032fc950
CUDA context created : 0280c950
CUDA module loaded:   03741ba0 C:\Users\Paul\Documents\Visual Studio 2013\Projects\TestCuda\ManagedCudaKernels\Debug\kernel.ptx
CUDA Debugger detected HW exception on 2 warps.  First warp:
blockIdx = {0,0,0}
threadIdx = {192,0,0}
Exception = Out of range Address
PC = 0x0004e2b0
FunctionRelativePC = 0x00000130


Nsight Debug
Changes to the Memory Checker will take effect when you restart the program being debugged.
CUDA grid launch failed: CUcontext: 41994576 CUmodule: 57940896 Function: _Z6kernelPiS_S_i
CUDA context created : 01f0c950
CUDA module loaded:   03820c70 C:\Users\Paul\Documents\Visual Studio 2013\Projects\TestCuda\ManagedCudaKernels\Debug\kernel.ptx
================================================================================
CUDA Memory Checker detected 512 threads caused an access violation:
Launch Parameters
    CUcontext    = 01f0c950
    CUstream     = 036c8e50
    CUmodule     = 03820c70
    CUfunction   = 037f1ea0
    FunctionName = _Z6kernelPiS_S_i
    GridId       = 1
    gridDim      = {21,1,1}
    blockDim     = {256,1,1}
    sharedSize   = 256
    Parameters:
        a = 0x00440000  ???
        b = 0x00445000  ???
        out = 0x0044a000  ???
        N = 5120
    Parameters (raw):
         0x00440000 0x00445000 0x0044a000 0x00001400
GPU State:
   Address  Size      Type  Mem       Block  Thread         blockIdx  threadIdx      PC  Source
-----------------------------------------------------------------------------------------------
  00440000     4    adr ld    g           0       0          {0,0,0}    {0,0,0}  000188  c:\users\paul\documents\visual studio 2013\projects\testcuda\managedcudakernels\kernel.cu:1
  00440004     4    adr ld    g           0       1          {0,0,0}    {1,0,0}  000188  c:\users\paul\documents\visual studio 2013\projects\testcuda\managedcudakernels\kernel.cu:1
  00440008     4    adr ld    g           0       2          {0,0,0}    {2,0,0}  000188  c:\users\paul\documents\visual studio 2013\projects\testcuda\managedcudakernels\kernel.cu:1
  0044000c     4    adr ld    g           0       3          {0,0,0}    {3,0,0}  000188  c:\users\paul\documents\visual studio 2013\projects\testcuda\managedcudakernels\kernel.cu:1
  00440010     4    adr ld    g           0       4          {0,0,0}    {4,0,0}  000188  c:\users\paul\documents\visual studio 2013\projects\testcuda\managedcudakernels\kernel.cu:1
  00440014     4    adr ld    g           0       5          {0,0,0}    {5,0,0}  000188  c:\users\paul\documents\visual studio 2013\projects\testcuda\managedcudakernels\kernel.cu:1
  00440018     4    adr ld    g           0       6          {0,0,0}    {6,0,0}  000188  c:\users\paul\documents\visual studio 2013\projects\testcuda\managedcudakernels\kernel.cu:1
  0044001c     4    adr ld    g           0       7          {0,0,0}    {7,0,0}  000188  c:\users\paul\documents\visual studio 2013\projects\testcuda\managedcudakernels\kernel.cu:1
  00440020     4    adr ld    g           0       8          {0,0,0}    {8,0,0}  000188  c:\users\paul\documents\visual studio 2013\projects\testcuda\managedcudakernels\kernel.cu:1
  00440024     4    adr ld    g           0       9          {0,0,0}    {9,0,0}  000188  c:\users\paul\documents\visual studio 2013\projects\testcuda\managedcudakernels\kernel.cu:1
  00440028     4    adr ld    g           0      10          {0,0,0}   {10,0,0}  000188  c:\users\paul\documents\visual studio 2013\projects\testcuda\managedcudakernels\kernel.cu:1
  0044002c     4    adr ld    g           0      11          {0,0,0}   {11,0,0}  000188  c:\users\paul\documents\visual studio 2013\projects\testcuda\managedcudakernels\kernel.cu:1
  00440030     4    adr ld    g           0      12          {0,0,0}   {12,0,0}  000188  c:\users\paul\documents\visual studio 2013\projects\testcuda\managedcudakernels\kernel.cu:1
  00440034     4    adr ld    g           0      13          {0,0,0}   {13,0,0}  000188  c:\users\paul\documents\visual studio 2013\projects\testcuda\managedcudakernels\kernel.cu:1
  00440038     4    adr ld    g           0      14          {0,0,0}   {14,0,0}  000188  c:\users\paul\documents\visual studio 2013\projects\testcuda\managedcudakernels\kernel.cu:1
  0044003c     4    adr ld    g           0      15          {0,0,0}   {15,0,0}  000188  c:\users\paul\documents\visual studio 2013\projects\testcuda\managedcudakernels\kernel.cu:1
  00440040     4    adr ld    g           0      16          {0,0,0}   {16,0,0}  000188  c:\users\paul\documents\visual studio 2013\projects\testcuda\managedcudakernels\kernel.cu:1
  00440044     4    adr ld    g           0      17          {0,0,0}   {17,0,0}  000188  c:\users\paul\documents\visual studio 2013\projects\testcuda\managedcudakernels\kernel.cu:1
  00440048     4    adr ld    g           0      18          {0,0,0}   {18,0,0}  000188  c:\users\paul\documents\visual studio 2013\projects\testcuda\managedcudakernels\kernel.cu:1
  0044004c     4    adr ld    g           0      19          {0,0,0}   {19,0,0}  000188  c:\users\paul\documents\visual studio 2013\projects\testcuda\managedcudakernels\kernel.cu:1
  00440050     4    adr ld    g           0      20          {0,0,0}   {20,0,0}  000188  c:\users\paul\documents\visual studio 2013\projects\testcuda\managedcudakernels\kernel.cu:1
  00440054     4    adr ld    g           0      21          {0,0,0}   {21,0,0}  000188  c:\users\paul\documents\visual studio 2013\projects\testcuda\managedcudakernels\kernel.cu:1
  00440058     4    adr ld    g           0      22          {0,0,0}   {22,0,0}  000188  c:\users\paul\documents\visual studio 2013\projects\testcuda\managedcudakernels\kernel.cu:1

and so on.
Any help would be appreciated.
Cheers
Dec 25, 2013 at 6:43 PM
Hi,

I don't have a GPU here with me to test, but I would remove the printf inside the kernel for all 5376 threads. Not sure what happens if too many threads use printf at the same time...

And by the way, this has nothing to do with your current problem but don't use
cntxt.FreeMemory(vectorDeviceA.DevicePointer);
cntxt.FreeMemory(vectorDeviceB.DevicePointer);
cntxt.FreeMemory(vectorDeviceOut.DevicePointer);
for managed objects, call their corresponding Dispose()-method, to make sure the GC knows that these objects are now invalid. The FreeMemory-methods in CudaContext are only ment to be used for pure CUdeviceptr allocated outside the managed world.

Best,
Michael