Why don't grow a performance?

Jan 26, 2014 at 3:53 PM
Please help me!

I need use parallel compution in my C# project and I try use CUDA.
I get the latest version of CUDA tool and tested the following code:
using System;
using System.Diagnostics;

using ManagedCuda;
using ManagedCuda.BasicTypes;

namespace ConsoleApplication1
{
    public partial class PredictSP500
    {
        const int VECTOR_SIZE = 34000000;

        static CudaKernel addTwoVectorWithCuda;

        static void InitKernels()
        {
            CudaContext cntxt = new CudaContext();
            CUmodule cumodule = cntxt.LoadModule(@"C:\Projects\MatrixCalc\MatrixCalc\Debug\kernel.ptx");
            addTwoVectorWithCuda = new CudaKernel("_Z6kernelPdS_S_i", cumodule, cntxt);
            addTwoVectorWithCuda.BlockDimensions = 256;
            addTwoVectorWithCuda.GridDimensions = 5120 / 256 + 1;
        }

        static Func<Double[], Double[], int, Double[]> addVectors = (a, b, size) =>
        {
            // init parameters
            CudaDeviceVariable<Double> vector_hostA = a;
            CudaDeviceVariable<Double> vector_hostB = b;
            CudaDeviceVariable<Double> vector_hostOut = new CudaDeviceVariable<Double>(size);
            // run cuda method
            addTwoVectorWithCuda.Run(vector_hostA.DevicePointer, vector_hostB.DevicePointer, vector_hostOut.DevicePointer, size);
            // copy return to host
            Double[] output = new Double[size];
            vector_hostOut.CopyToHost(output);
            return output;
        };

        static void Main(string[] args)
        {
            InitKernels();
            Random rand = new Random(DateTime.Now.Millisecond);

            Double[] vectorA = new Double[VECTOR_SIZE];
            Double[] vectorB = new Double[VECTOR_SIZE];
            Double[] vector = new Double[VECTOR_SIZE];

            for (int i = 0; i < VECTOR_SIZE; i++)
            {
                vectorA[i] = rand.NextDouble();
                vectorB[i] = rand.NextDouble();
            }

            Stopwatch stp = new Stopwatch();    // // Test performance of GPU
            stp.Start();

            vector = addVectors(vectorA, vectorB, VECTOR_SIZE);

            stp.Stop();
            Console.WriteLine("time gpu: " + stp.Elapsed.TotalMilliseconds);

            Stopwatch stp2 = new Stopwatch();   // Test performance of CPU
            stp2.Start();

            for (int i = 0; i < VECTOR_SIZE; i++)
            {
                vector[i] = vectorA[i] * vectorB[i] *i ;
            }

            stp2.Stop();
            Console.WriteLine("time cpu: " + stp2.Elapsed.TotalMilliseconds);

            Console.ReadKey();
        }

    }
}
And code for video card:
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>

__global__ void kernel(double* a, double* b, double* out, int N)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < N)
        out[i] = a[i] * b[i];
}
 
int main()
{
    // Number of CUDA devices
    int devCount;
    cudaGetDeviceCount(&devCount);
    printf("CUDA Device Query...\n");
    printf("There are %d CUDA devices.\n", devCount);

    // Iterate through devices
    for (int i = 0; i < devCount; ++i)
    {
        // Get device properties
        printf("\nCUDA Device #%d\n", i);
        cudaDeviceProp devProp;
        cudaGetDeviceProperties(&devProp, i);
    }
    return 0;
}
I something do wrong, because a result:

time gpu: 270
time cpu: 120

How do you think, what could be the reason? Please, help!
Coordinator
Jan 26, 2014 at 4:37 PM
Edited Jan 26, 2014 at 5:03 PM
You compare two different things: What you do during GPU time measurement:
  • 2x allocate memory on device and copy data host -> device
CudaDeviceVariable<Double> vector_hostA = a;
CudaDeviceVariable<Double> vector_hostB = b;
  • allocate Memory on device
CudaDeviceVariable<Double> vector_hostOut = new CudaDeviceVariable<Double>(size);
  • launch kernel
addTwoVectorWithCuda.Run(vector_hostA.DevicePointer, vector_hostB.DevicePointer, vector_hostOut.DevicePointer, size);
  • allocate memory on host and copy data device -> host
Double[] output = new Double[size];
vector_hostOut.CopyToHost(output);
vs.
on CPU side you only do the multiplication.

So to compare times, you should either also count allocation times on CPU side, or use previously allocated CudaDeviceVariables on gpu side. Also most managedCuda classes implement the IDispose interface, meaning you are responsible for cleaning up unmanaged resources. This means, before you exit your "addVectors", you should call vector_hostA.Dispose() (same for the others). By the way, why are they called vector_HOST? They are allocated on DEVICE...

If you only want to know the time spent on pure computation on GPU, take the return value of addTwoVectorWithCuda.Run(): That's the time needed to execute the kernel in milliseconds.

Best,
Michael
Jan 26, 2014 at 5:20 PM
Thank you very much! You're right! I measured the performance of operations, excluding the memory and GPU really immediately ahead of 6 times the CPU. But how to work with the GPU without copying back and forth ... I was hoping that winning will be so significant that will cover the costs, but alas. Now have to rewrite the code so that it immediately stored in the memory card or something like that ..
By the way, why are they called vector_HOST?
I got that example from some site and just use it without large corrects :)