Block/Grid query

Jul 7, 2015 at 2:31 PM
Hi Michael,

I've been having some trouble getting a kernel of mine working as expected. So to track down the problem, I threw in a few printf statements. Eventually, I narrowed the code down to the following kernel:
#define _SIZE_T_DEFINED
#ifndef __CUDACC__
#define __CUDACC__
#endif
#ifndef __cplusplus
#define __cplusplus
#endif

#include <cuda.h>
#include <device_launch_parameters.h>
#include <texture_fetch_functions.h>
#include "float.h"
#include <builtin_types.h>
#include <vector_functions.h>

extern "C"
{
    __global__ void TestKernel()
    {
        printf("b.x, b.y, b.z, bd.x, bd.y, bd.z = %d, %d, %d, %d, %d, %d\n",
            blockIdx.x, blockIdx.y, blockIdx.z,
            gridDim.x, gridDim.y, gridDim.z
            );
        __syncthreads();
    }
}
and calling code:
using ManagedCuda;
using ManagedCuda.VectorTypes;

namespace TestCuda
{
    class Program
    {
        static void Main(string[] args)
        {
            var cntxt = new CudaContext();
            var cumodule =
                cntxt.LoadModule(@".\kernel.ptx");
            var kernel = new CudaKernel("TestKernel", cumodule, cntxt)
            {
                GridDimensions = new dim3(5, 28, 2),
                BlockDimensions = new dim3(27, 2),
            };

            // run cuda method
            kernel.Run();

            cntxt.UnloadKernel(kernel);
        }
    }
}
in order to demostrate my problem.

If you run that kernel, I get output that does not include gridblocks for the "y" dimension, less than 12.

Not sure why this happens. It's obviously a lack of understanding something around grid dimensions that I have. Could you spare the time to set me straight please?

Cheers,
Paul
Jul 7, 2015 at 4:43 PM
Edited Jul 7, 2015 at 4:51 PM
OK, I think I've spotted it. printf spools to a memory limit of 1mb, and will start overwriting itself once it hits that limit. Explains some of the output behaviour I'm seeing. So next question: How do I change the buffer size? I've put the following in, but it doesn't seem to have made any difference.
            var cntxt = new CudaContext();
            cntxt.SetLimit(CULimit.PrintfFIFOSize, new SizeT(1024 * 1024 * 10));
EDIT: ok, it would seem it never compiled properly thanks to some other process holding a lock on the exe. All good.