Async driver calls running only on stream 0

Jan 5, 2014 at 11:01 AM
I am trying to run async methods (such as RunAsync, CopyToDeviceAsync, etc.).
I've created few CudaStream instances on Thread 0 and have given them to the methods which are running on diffrent threads (one thread for each stream). By looking at the profiler results for the application, I can see that all the driver api executions were ran only on stream 0.
Any ideas what can cause the problem?

Thanks in advance.
Jan 5, 2014 at 12:35 PM
Hi pizo,

can you give a little more information regarding your problem? Are you using one single CudaContext for all threads, or one per thread, are you using pinned host memory, etc.? May be a little code snipped might help to reproduce your case.
Have you also tried to profile a native C/Cuda application using async-methods, e.g. the concurrentKernels sample (in ProgramData\NVIDIA Corporation\CUDA Samples\v5.5\6_Advanced\concurrentKernels), meaning is it a managedCUDA issue or a problem of your machine?

Jan 6, 2014 at 11:02 AM
Edited Jan 6, 2014 at 11:03 AM
I am using one CudaContext for all of the threads. I did tried to profile the cuda streams example and I clearly saw it works properly. Unfortunatly and I can't show you a code snippet but the general case is:
  • A cuda context is created in the Main Thread.
  • The CudaStreams (A and B) are created in the Main Thread.
  • Two diffrent threads (A and B) are running the same code which call the async driver methods in this order (each thread uses a diffrent instance of CudaStream)
    • AsyncCopyToDevice
    • RunAsync
    • AsyncCopyToHost
I chose not to use page-locked memory. I've checked it in CUDA C and it works (The profile shows two streams).

My device properties are: Tesla K20c (compute capability 3.5)

Jan 6, 2014 at 12:20 PM
I'm still trying to reproduce the problem, but without any luck. NSight always reports the right number of streams with right stream IDs (on a Geforce TITAN). Could you please check two more things: Try to run the simpleStreams sample from managedCuda samples in your profiler, do the streams appear right here (doing so, we can test the same code and maybe get a difference)? And second, can you test with page-locked memory? If not using page-locked memory, streams get serialized for sure, but I don't know if and how this will affect stream IDs in the profiler.

Feb 16, 2015 at 2:36 PM
Hi Michael

The problem was caused when I started the NSight profile after the stream creation and therefore it didn't recognize that I'm using streams.

Thank you and sorry for the very long delay!