Комментировать

Profiler counters for GPUs

Profiler counters for GPUs with compute capability 2.0

# branch : Number of branches taken by threads executing a kernel. This counter will be incremented by one if at least one thread in a warp takes the branch.
# divergent branch : Number of divergent branches within a warp. This counter will be incremented by one if at least one thread in a warp diverges (that is, follows a different execution path) via a data dependent conditional branch. The counter will be incremented by one at each point of divergence in a warp.
# sm cta launched : Number of threads blocks launched on a multiprocessor.
# local load : Number of executed local load instructions per warp on a multiprocessor.
# local store : Number of executed local store instructions per warp on a multiprocessor.
# gld request : Number of executed global load instructions per warp on a multiprocessor.
# gst request : Number of executed global store instructions per warp on a multiprocessor.
# shared load : Number of executed shared load instructions per warp on a multiprocessor.
# shared store : Number of executed shared store instructions per warp on a multiprocessor.
# instructions issued : Number of instructions issued including replays
# instructions executed : Number of instructions executed, do not include replays
# warps launched : Number of warps launched on a multiprocessor.
# threads launched : Number of threads launched on a multiprocessor.
# active cycles : Number of cycles a multiprocessor has at least one active warp.
# active warps : Accumulated number of active warps per cycle. For every cycle it increments by the number of active warps in the cycle which can be in the range 0 to 48.
# l1 global load hit : Number of global load hits in L1 cache
# l1 global load miss : Number of global load misses in L1 cache
# l1 local load hit : Number of local load hits in L1 cache
# l1 local load miss : Number of local load misses in L1 cache
# l1 local store hit : Number of local store hits in L1 cache
# l1 local store miss : Number of local store misses in L1 cache
# l1 shared bank conflicts : Number of shared bank conflicts

И при запуске - видны только они

Только с текущим десктопным драйвером (197-м) все это как-то не работает, примеры ругаются примерно так:

d:/bld_sdk10_x64.pl/devtools/SDK10/Compute_3.1/SDK10/Compute/C/src/bandwidthTest/bandwidthTest.cu(602) : cudaSafeCall() Runtime API error : CUDA driver version is insufficient for CUDA runtime version.

Т.е. нужны драйвера 256+ (255+ на Linux). Можно, вероятно, пробовать натянуть драйвера от Теслы (которые есть), пишут что работает, но это же вторую видеокарту ставить и все такое...