cuSynchronizeEvent API call performance

May 23, 2014 at 12:55 PM
Edited May 23, 2014 at 1:00 PM
Hi, I need to launch small kernels very rapidly. One kernel launch takes from tens or hundreds of us to only few ms to complete. Nvprof tells me this:

Time(%) Time Calls Avg Min Max Name
65.95% 2.94368s 454 6.4839ms 1.1734ms 13.511ms Kernel1
31.14% 1.38990s 681 2.0410ms 14.304us 6.5109ms Kernel2
2.31% 103.31ms 227 455.12us 448.61us 513.06us Kernel3
a few more...

==3032== API calls:
Time(%) Time Calls Avg Min Max Name
89.45% 4.61551s 2043 2.2592ms 51.957us 13.757ms cuEventSynchronize

If you sum all kernel launches they add to 2043 calls so cuEventSynchronize is called after every kernel call and it takes little more time than kernel execution. I have read somewhere that in unmanaged code CPU can launch like 300 000 000 empty kernels per second.

Question is: Is it a problem of managedCuda (and how can I turn it off)?
May 23, 2014 at 3:20 PM
Edited May 23, 2014 at 11:04 PM

managedCuda calls explicitly after each kernel call cuEventSynchronize, which on standard drivers without TCC enabled might slow down a bit the overall performance for small kernels. To avoid this, you can try several things: Either try another context scheduling, e.g. CUCtxFlags.SchedSpin or remove the explicit sync in the kernel call if not needed. You can also try to use the Async-kernel call methods (kernel.RunAsync(...))

EDIT: I re-read your post and got now your point. The thing is, that you can't sum the numbers you show, kernel lauch and cuEventSynchronize run in parallel. cuEventSynchronize is used to keep the host in sync with the devie, thus the time spent here must more or less be equal to the overall kernel runtime. As I said before, you can try to avoid the sync and gain a few µs to ms depending on your system. But if you have for example some implicit synchronization after each kernel call, e.g. a copy from device to host, you would see the time spent in this copy call instead of cuEventSynchronize.

Long story short, I don't think that there is an actual problem here, just some misreading of numbers.

Marked as answer by Gwynbleidd on 6/13/2014 at 6:08 AM
Jun 13, 2014 at 1:07 PM
Hi Michael,

I didnt realize at the moment that RunAsync method is still serializing kernels which are in the same stream, which is exactly what I have needed.

Thank you, Martin.