This project is read-only.

Extreme increase of memory while running a kernel

Jun 28, 2014 at 8:09 PM
I notice an extreme increase in the GPU memory while running a Kernel of the order of GB and my data are of the order less than 1MB. Something is going terribly wrong. Any ideas why this is happening?
Jun 28, 2014 at 9:04 PM
Edited Jun 28, 2014 at 9:05 PM

My guess is that you use the '='-operator of CudaDeviceVariable in a loop like:
CudaDeviceVariable<int> devvar;
int[] hostvar = new int[100];
for (int i = 0; i < 1000000; i++)
      devvar = hostvar;
      //... call kernel, do some stuff etc.
Note that here the previous devvar object is not disposed before being replaced by a new object created by the '='-operator. So either call devvar.Dispose() before overwriting or, instead of allocating and freeing memory each time, use devvar.CopyToDevice(...).
The '='-operator is not meant for copying, it is for allocation.
But without a code example, this is just wild guessing...

Jun 28, 2014 at 9:12 PM
I have solved the was the PTX compilation of the kernel.
No worries I am a CUDA expert but in C language I do not do such mistakes as not disposing the garbage. :-)
Now on optimization of the kernels having solved all other problems apart from it...I just wanted to isolate everything to very existance of the kernels...Now lets see...

Jun 28, 2014 at 9:54 PM
Let me show you a bit of the code :
public class GPUParametricFunction
        private CudaDeviceVariable<Function> dNodes;
        private PlanNode[] plan;
        private CudaDeviceVariable<int> dIndexMap;
        private CudaDeviceVariable<double> dDerivativeValues;
        private CudaKernel kernSmall;
        private CudaKernel kernLarge;
        private CudaContext ctx;
        public GPUParametricFunction(CudaDeviceVariable<Function> devNodes, PlanNode[] plan, 
                                     CudaDeviceVariable<int> devIndexMap, CudaDeviceVariable<double> devDerivativeValues,
                                     CudaContext ctx, CudaKernel kernelSmall, CudaKernel kernelLarge)
            this.ctx = ctx;
            dNodes = devNodes;
            this.plan = plan;
            dIndexMap = devIndexMap;
            dDerivativeValues = devDerivativeValues;
            kernSmall = kernelSmall;
            kernLarge = kernelLarge;
        public void Dispose()
            if (dNodes != null) dNodes.Dispose();
            if (dIndexMap != null) dIndexMap.Dispose();
            if (dDerivativeValues != null) dDerivativeValues.Dispose();
            if (ctx != null) ctx.Dispose();
        public Tuple<double[], double> Differentiate(double[] args, double[] parameters)
            CudaDeviceVariable<double> dArgs = args;
            CudaDeviceVariable<double> dParameters = parameters;
           for (int i = 0; i < plan.Length; i++)
                if (plan[i].smallKernel == 1)
                    kernSmall.BlockDimensions = new dim3(1024, 1, 1);
                    kernSmall.GridDimensions = new dim3(1, args.Length, 1);
                    kernSmall.Run(dNodes.DevicePointer, plan[i].numberOfElements, plan[i].startIndex, args.Length, dIndexMap.DevicePointer,
                                  dDerivativeValues.DevicePointer, plan[i].startDepth, plan[i].endDepth, dArgs.DevicePointer, dParameters.DevicePointer);
                    kernLarge.BlockDimensions = new dim3(1024, 1, 1);
                    kernLarge.GridDimensions = new dim3((int)Math.Ceiling(1.0 * plan[i].numberOfElements / 1024), args.Length, 1);
                    kernLarge.Run(dNodes.DevicePointer, plan[i].numberOfElements, plan[i].startIndex, args.Length, dIndexMap.DevicePointer,
                                  dDerivativeValues.DevicePointer, dArgs.DevicePointer, dParameters.DevicePointer);
            double[] diff = new double[args.Length];
            dDerivativeValues.CopyToHost(diff, 0, 0, args.Length*sizeof(double));
            Function nodeEval = dNodes[0];
            Tuple<double[], double> result = new Tuple<double[], double>(diff, nodeEval.function_value);
            return result;
Differentiate is called around 2000 times. As you can see I am disposing the memory. Though in GPU meter as the kernel run I am seeing a weird increase of memory. It should be less than 100MB, though I am seeing an increase around 300MBs....This is not right. Any insight? I am certain that the class is constructed only once. I have tried removing the for loop in the differentiate and the memory consumption was close to zero as it should be. So what is going on is beyond me. Any insights?

Jun 28, 2014 at 11:22 PM
Well, allocating and freeing 2000 times some memory of course fragments a lot the available space; I doubt that Cuda runtime handles this without any loss. I don't know if this explains the 300 MB, but it certainly could...
On the other hand, you also loose quiet some time in doing all these allocations, have you tried to allocate once and then just use the memory by only copying the data to device?

Jun 29, 2014 at 12:45 AM
Yes I have thought of this also. Will try it. You are right I need to preallocate this also. Frankly its on the kernel optimization. The integration is finished successfully to a very big code. Tree structure concepts are hard for CUDA as I am seeing especially this case. Does not matter its the first attempt to deal with it this way. As for a first time globally I am satisfied.