This project is read-only.

About managedCuda

ManagedCuda provides an intuitive access to the Cuda driver API for any .net language. It is kind of an equivalent to the runtime API (= a comfortable wrapper of the driver API for C/C++) but written entirely in C# for .net. In contrast to the runtime API, managedCUDA takes a different approach to represent CUDA specifics: managedCuda is object oriented. In general you can find C# classes for each Cuda handle in the driver API. For example, instead of a handle CUContext, managedCUDA provides a CudaContext class. This design allows an intuitive and simple access to all API calls by providing correspondent methods per class. A good example for this wrapping approach is a device variable. In the original Cuda driver API those are given by standard C pointers. In managedCuda these are represented by the class Cuda[Pitched]DeviceVariable<T>. It is a generic class allowing type safe and object oriented access to the Cuda driver API. As a CudaDeviceVariable instance knows about its wrapped data type, array sizes, dimensions and eventually a memory alignment pitch, a simple call to CopyToHost(“hostArray”) is enough. The user doesn’t need to handle the entire C like function arguments, this is all done automatically. Further managedCuda provides specific exceptions in case something goes wrong, i.e. you don’t need to check API call return values, you only need to catch the CudaException just as any other exception.

But still, as a developer using managedCuda you need to know Cuda. You must know how to use contexts, set kernel launch grid configurations etc.

I will shortly describe in the following the main classes used to implement a fully functional Cuda application in C#:

The CudaContext class: This is one of the three main classes and represents a Cuda context. From Cuda 4.0 on, the Cuda API demands (at least) one context per process per device. So for each device you want to use, you need to create a CudaContext instance. In the different constructors you can define several properties, e.g. the deviceID to use. As nearly all managedCuda classes, CudaContext implements IDisposable and the wrapped Cuda context is valid until Dispose() is called. Further CudaContext defines a bunch of static methods to retrieve general information about (possible) Cuda devices. Important for multi threaded applications: In order to use any cuda object related to a context, you must activate the cudaContext by calling the SetCurrent() method from the current thread. This holds for all thread switches. (See the Cuda programming guide for more information).

CudaKernel: Cuda kernels are load from cubin or ptx files. You can load a kernel using the LoadKernel…() methods of a CudaContext using a byte array representation of the kernel file (e.g. an embedded resource) or by specifying the file name where the kernel is stored. Further you need the kernel name as defined in the source *.cu file. The LoadKernel methods return a CudaKernel object bound to the given context. CudaKernel does not implement IDisposable, as the kernels are automatically destroyed as soon as the corresponding context is destroyed.

CudaDeviceVariable and its variations: A CudaDeviceVariable object represents allocated memory on the device. The class knows about the exact memory layout (as array length, array dimension, memory pitch, etc.). As the class is a generic, it also knows about its type and type size. All this simplifies dramatically any data copying as no size parameters are needed. Only the source or destination array must be defined (either a default C# host array or another device variable). Device memory is freed as soon as the CudaDeviceVariable object is disposed.

With these three main classes one can create an entire Cuda accelerated application in C# using only very few code lines.

Other managedCuda classes:

CudaPagelockedHostMemory: In order to use asynchron copy methods (host to device or device to host) the host array must be allocated as pinned or page-locked memory. To realize this, CudaPagelockedHostMemory[2D,3D] allocates the memory using cuda’s cuMemHostAlloc. To simplify access per element, the class provides an index property to get or set single values. When implementing large datasets you must know that each single per element access trespasses the managed/unmanaged memory barrier and must be marshaled. Access is therefore not really fast. To handle large amount of data, a copy of a managed array to the unmanaged memory in one block would be faster.

CudaPagelockedHostMemory_[Type]: As the previous approach using generics and marshalling was not satisfying in terms of speed and direct pointer arithmetic with generics is not possible in C#, I tried something new, what I would call "templates with C#" using T4: A T4 template creates all possible variants like 'float', 'int4', etc. which then access memory directly via pointers. The achieved performance of this approach is close to native arrays. In case you want to use CudaPagelockedHostMemory with your own datatypes, simply copy the tt-file to your project and modify the list of types to process (but be aware of the license: managedCUDA is LGPL!).

CudaManagedMemory_[Type]: Using the same approach as for page locked memory, CudaManagedMemory gives access to the full feature set of managed memory introduced with Cuda 6.5 in .net.

CudaRegisteredHostMemory: In C++, registered host memory is normally allocated memory but with registration it gets usable for asynchron copies. But in the .net world this doesn’t work as expected: Also CudaRegisteredHostMemory is part of ManagedCUDA it shouldn’t be used. Use CudaPagelockedHostMemory instead.

CudaArray[1D,2D,3D]: Represents a CUArray. Either you specify an already existing CUArray as storage location, e.g. from graphics interop, or a new CUArray is created internally. Only if the inner CUArray was allocated by the constructor, it will be freed while disposing.

CudaTextureFoo: Represents a Cuda texture reference. The device memory to bind this texture to can either be created internally by the constructor or passed as an argument. Only if memory is allocated by the constructor it will be freed while disposing.

GraphicsInterop: Several graphics interop resource classes exist, one for every graphics API (DirectX or OpenGL). All these resources must be registered and can be mapped to cuda variables, cuda textures or cuda arrays, depending on their type. For efficient mapping, all resources can be grouped in a CudaGraphicsInteropResourceCollection, so that one single Map() call is enough to finish the task. Have a look at the sample applications to see how to use the collection.


Additional libraries:

  • CudaFFT: Managed access to cufft*.dll
  • CudaRand: Managed access to curand*.dll
  • CudaSparse: Managed access to cusparse*.dll
  • CudaBlas: Managed access to cublas*.dll
  • CudaSolve: Managed access to cusolve*.dll
  • NPP: Managed access to npp*.dll
  • NVRTC: Managed access to nvrtc*.dll

All libraries have in common that they compile either to 32 or 64 bit in order to handle different wrapped dll names for 32 or 64 bit. They include a basic representation called *NativeMethods to call directly the API functions and wrap handles with C# classes.

CudaBitmapSource is a simple try to use Cuda device memory as a BitmapSource in WPF. It is more like a proof of concept than a ready to use library, especially the fact that BitmapSource is a sealed class makes a proper implementation difficult. If you have ideas for improvements or a better design, please let me know ;-)

How To: Setup a C# Cuda project using Visual Studio 2010 (Solution 1):

(My Visual Studio is a German edition, some “translated” menu entries might therefor differ slightly from the original English menu entries.)

You need: Microsoft Visual Studio 201x, Nvidia Cuda Toolkit 7.0, Nvidia Parallel Nsight 4.0 for debugging and of course managedCuda.

  • Create a normal C# project (library, WinForms, WPF, etc.).
  • Add a new CudaRuntime 7.0 project to the solution.
  • Delete the Cuda sample code. To enable proper IntelliSense functionality you need to include the following header files to your *.cu file (from toolkit-include folder):
    #include <cuda.h>
    #include <device_launch_parameters.h>
    #include <texture_fetch_functions.h>
    #include <builtin_types.h>
    #include <vector_functions.h>
    #include “float.h”
  • Also add the following defines:
    #define _SIZE_T_DEFINED
    #ifndef __CUDACC__
    #define __CUDACC__
    #ifndef __cplusplus
    #define __cplusplus
  • Write your kernel code in an “extern C{}” scope:
    //Includes for IntelliSense 
    #define _SIZE_T_DEFINED
    #ifndef __CUDACC__
    #define __CUDACC__
    #ifndef __cplusplus
    #define __cplusplus

    #include <cuda.h> #include <device_launch_parameters.h> #include <texture_fetch_functions.h>
    #include "float.h" #include <builtin_types.h>
    #include <vector_functions.h>

    // Texture reference
    texture<float2, 2> texref;
    extern "C"  
    { //kernel code __global__ void kernel(/* parameters */) { } }
  • You can also omit ‘extern “C”’ in order to use templated kernels. But then kernel names get mangled (“_Z18GMMReductionKernelILi4ELb1EEviPfiPK6uchar4iPhiiiPj” instead of  “GMMReductionKernel”, to look up the right mangled name open the compiled ptx file with a text editor). To load a kernel you need the full mangled name.
  • Change the following project properties of the CudaRuntime 7.0 project:
    • General:
      Output directory: Set it to the source file directory of the C# project
      Application type: help application. This avoids a call to the VisualC++ compiler, no C++ output will be created.
    • CUDA C/C++:
      Compiler Output: $(OutDir)%(FileName).ptx or .cubin
      NVCC Compilation Type: “Generate .ptx file (-ptx)” or “Generate .cubin file (-cubin)” respectively
    • You need to set these properties for all possible targets and configurations (x86/x64, Debug/Release). To handle mixed mode platform kernels, give a different kernel name for x86 and x64, for example $(OutDir)%(FileName)_x86.ptx and $(OutDir)%(FileName)_x64.ptx.
  • Delete the post build event: We don’t need the CUDA runtime libraries copied.
  • Build the Cuda project once for each platform.
  • In the C# project, add the newly build kernel files in the C# project source directory to the project. Set the file properties either to embedded resource (access files by stream (byte[]) when loading kernel images) or set “copy to output directory” to “always” and load the kernel image from file.
  • Add a reference to the managedCuda assembly.


How To: Setup a C# Cuda project using Visual Studio 2010 (Solution 2 from Brian Jimdar)

Using pre-build events:

In the project properties-page of your C# project, add the following pre-build event:

call "%VS100COMNTOOLS%vsvars32.bat"
for /f %%a IN ('dir /b "$(ProjectDir)Kernels\*.cu"') do nvcc -ptx -arch sm_11 -m 64 -o "$(ProjectDir)PTX\%%~na_64.ptx" "$(ProjectDir)Kernels\"
for /f %%a IN ('dir /b "$(ProjectDir)Kernels\*.cu"') do nvcc -ptx -arch sm_11 -m 32 -o "$(ProjectDir)PTX\%%~na.ptx" "$(ProjectDir)Kernels\"

This builds a x86 and x64 version of each file in the .\Kernels directory, outputs it to the .\PTX directory.

Last edited Apr 6, 2015 at 6:10 PM by kunzmi, version 12


No comments yet.