Reduction - Interface with thrust?

Oct 7, 2013 at 8:38 AM
I need to perform an efficient reduction. Is this possible using managed cuda?

My intuition was to use thrust, but as thrust cannot be called from inside a kernel, i cannot see how it can be done using managed cuda.

If thrust is not available using managed cuda, are there any other ways to perform an efficient reduction? I know i can just implement a kernel myself, but as reduction is a very general problem, i would rather rely on well tested library code rather than doing an implementation myself.
Oct 7, 2013 at 11:38 AM

no you can't use thrust with managedCuda as it is a header-only library for C++. You could try to extract some how the created kernels and use them, if you implement your algorithm in C++, but I'm not sure how this might work. On the other hand, you could try to use basic reduction routines like min, max, mean, etc. form the NPP library, which is entirely available from within C#. If you need more sophisticated ones, you're currently left to write your own kernels.

Oct 7, 2013 at 12:07 PM

that was what I feared. The NPP implementations look nice. However what I have is a huge (~ 250.000.000) array of floats which I need to bin (~ 2.500 bins) and then calculate average pr. bin in real time. As I cannot afford to copy the array, I need to be able to feed the algorithm a start pointer and an end pointer into the original array (or a start/end index).

I can't see how to do this using NPP. Do you know, if it is possible?

Oct 7, 2013 at 12:43 PM

use the following constructor of CudaDeviceVariable:
/// <summary>
/// Creates a new CudaDeviceVariable from an existing CUdeviceptr.
/// devPtr won't be freed while disposing.
/// </summary>
/// <param name="devPtr"></param>
/// <param name="size">Size in Bytes</param>
public CudaDeviceVariable(CUdeviceptr devPtr, SizeT size)
    : this (devPtr, false, size)
this will create you a "view" of your current variable, if you add an offset to devPtr (in bytes) and the size of your bin (in bytes again). Doing so, you only create a new lightweight wrapper instance around the "real device memory" without any copies. Then you call NPPs mean function on this new CudaDeviceVariable.

Marked as answer by emher on 10/7/2013 at 10:09 AM
Oct 7, 2013 at 3:58 PM
Take a look at CUB (CUDA Unbound)
It provides highly optimized device primitives for common tasks like sort, scan, reduction, histogram etc... pretty much exactly what you need.
Oct 7, 2013 at 5:08 PM
Edited Oct 7, 2013 at 5:09 PM
"#" kunzmi - That was exactly what I was looking for. I works exactly as intended.

"#" RoBiK - That looks promising too. A more complex solution, but it might be faster too. Unlike thrust the CUB templates seems callable from kernels and hence usable with managed cuda.