This project is read-only.

Performance problems with two GPU's

Feb 16, 2015 at 3:31 PM

I'm using CUDA 5.5 and the ManagedCUDA version to match. I suffer from performance problems when trying to use multi-GPUs and I'll be happy to hear your advice.
My setup is like this:
  1. I have two GPU's, and a context for each.
  2. I am using multiple streams with a different CPU thread for each stream.
  3. Every time I call the CUDA API in a different CPU thread, I use the SetCurrent() and Push/PopContext() functions.
What I see (Using NSight performance profiler):
  1. When using only one GPU with multi-streams: There's great concurrency; you can see the kernel launches divided into streams, occurring simultaneously and the GPU is maxed out. You can literally see the blocks interleaved perfectly.
  2. When using both GPU's with multi-streams: The concurrency is very bad. You can see that for a short while that some kernels occupy all of the GPU resources, but then only one stream is left active while the rest look "dormant". I mean, on NSight you can see that there's only one "line" of kernel launches, matching the single active stream, and the rest are empty.
Feb 16, 2015 at 10:53 PM
To check if I understood your question: You have two GPUs, two contexts and on each device multiple streams running concurrently with one CPU-thread per stream?

First of all, you shouldn’t mix SetCurrent() and the Push/PopContext() functions; it’s either the one or the other, where SetCurrent() is the newer method to switch contexts and probably the better one. (Also see the programming guide for more info on that.) Then I don’t see why you need multiple CPU-threads for multiple streams. The streams are async, so one CPU-thread per device should be sufficient so that you can avoid the context switches. In larger applications I usually have one working thread for one context, and this thread is the only one used for communication with the device.

And another possibility is also that NSight is simply not capable of capturing all devices at the same time, I never tried that.

Finally, when using multiple devices I switched to MPI, i.e. executing multiple processes (one per device) instead of using multiple devices in one process. If one is also dealing with libraries like CUFFT or NPP, multithreading simply gets too ugly with random crashes… (I’m mostly using this in a native C++ application)
Feb 17, 2015 at 1:13 PM
Hi, thanks for the quick reply.

You understood correctly.
It looks like that NSight do support multiple device sampling because you can see it recognize both of my devices.
I also switched to only using SetCurrent(), still no change :(

My processing chain is serial and what I'm doing is running multiple processing chains simultaneously therefore I need different thread for each stream. I want to add another GPU to improve my performance.

I duplicated my processing chains so that every device has it's own independent process and context referring to different device, with the following scenarios:
  1. I profiled each of the processes separately, each one of them worked properly alone.
  2. I profiled only the first process referencing to Device 0, and let it work for 15 seconds. Afterwards, I launched the second process (without profiling) which referred to Device 1, let it work for few seconds and shut down the profiling.
In the results of the later scenario, I could see that at the first 15 seconds (in which only one device worked), all of the streams was working as expected. After I launched the second process, the streams again didn't work properly and the API call times had increase significantly (from ~9ms to almost ~90ms!!).

I am starting to suspect a hardware problem (therefore I have no assurances), both of my GPU's are connected with "OneStop" (external hardware which is connected via PCIE) to my server.

Thanks again.