I had modified the supplied SDK template code in a minimal way in order to measure CPU vs GPU performance and found that for the simple test code (1 float multiplication) that the E8400 CPU with a claimed 24 Gflops was handily out performing a GPU with a theoretical max 504 Gflops. Where was all the time going? Was the kernel the culprit, the memory copy or something else? I started out by trying to reuse the
cutilCheckError( cutStartTimer( timer));
timing method already in the template. Looking into the CUDA source in SDK\common\src\stopwatch_win.cpp showed that on Windows it was using the QueryPerformanceFrequency method which uses the highest possible resolution hardware timer … on the CPU. Using it to measure GPU performance is problematic because timing the GPU using a CPU timer requires the GPU and the CPU to be synchronised with:
cudaThreadSynchronize();
and ruins the timing information. To measure times on the GPU I needed to use GPU based timing on stream 0 using events:
cudaEventRecord(start, 0);
So I created an array of start and stop events, broke the GPU processes into 5 steps and timed everything. The 5 GPU processes were:
1) Alloc: Host to Device – The allocation of memory on the device for the input array which needed to be copied over from the host.
2) Copy: Host to Device – Copying the input array from the host onto the device. Data size divided by time taken here would give bandwidth.
3) Alloc: Device to Host – The allocation of memory on the device for the output array where the result would be stored before being copied back to the host.
4) Compute – Running the actual kernel, reading from the input array, processing and writing results to the output array.
5) Copy: Device to Host – Copying the output array back to the host.
I also retained my CPU timing to measure the amount of time it took for the GPU to do everything and get the answer back onto the host – that way I’d have a 1:1 comparison against the CPU version. That gives one more thing to measure, how does the sum of the GPU times compare to the overall CPU time?
6) Sync with CPU – CPU time minus sum of GPU times indicates how long it takes to sync the two.
Set up 5 GPU timers to get a breakdown of where the GPU was spending time and keep the 2 CPU timers for the original comparison:
// GPU timers - used to time GPU streams
int cGpuTimer = 5;
cudaEvent_t* rgGpuTimer_start = (cudaEvent_t*) malloc (sizeof(cudaEvent_t)*cGpuTimer);
cudaEvent_t* rgGpuTimer_stop = (cudaEvent_t*) malloc (sizeof(cudaEvent_t)*cGpuTimer);
for (int i=0;i<cGpuTimer;i++)
{
cutilSafeCall( cudaEventCreate( &rgGpuTimer_start[i] ) );
cutilSafeCall( cudaEventCreate( &rgGpuTimer_stop[i] ) );
}
and wrap all the GPU calls with timing calls:
cutilCheckError( cutStartTimer( rgTimer[0]));
// Alloc: Host to Device
cutilSafeCall( cudaEventRecord( rgGpuTimer_start[0], 0 ) );
float* d_idata;
cutilSafeCall( cudaMalloc( (void**) &d_idata, global_mem_size));
cutilSafeCall( cudaEventRecord( rgGpuTimer_stop[0], 0 ) );
// Copy: Host to Device
cutilSafeCall( cudaEventRecord( rgGpuTimer_start[1], 0 ) );
cutilSafeCall( cudaMemcpy( d_idata, h_idata, global_mem_size, cudaMemcpyHostToDevice) );
cutilSafeCall( cudaEventRecord( rgGpuTimer_stop[1], 0 ) );
// Alloc: Device to Host
cutilSafeCall( cudaEventRecord( rgGpuTimer_start[2], 0 ) );
float* d_odata;
cutilSafeCall( cudaMalloc( (void**) &d_odata, global_mem_size)); // The pad won't be read back
cutilSafeCall( cudaEventRecord( rgGpuTimer_stop[2], 0 ) );
// Compute
cutilSafeCall( cudaEventRecord( rgGpuTimer_start[3], 0 ) );
dim3 gridDim ( cBlocksPerGridx, cBlocksPerGridy, 1);
dim3 blockDim( cThreadsPerBlock, 1, 1);
testKernel<<< gridDim, blockDim, shared_mem_size >>>( d_idata, d_odata, cData);
cutilCheckMsg("Kernel execution failed");
cutilSafeCall( cudaEventRecord( rgGpuTimer_stop[3], 0 ) );
// Copy: Device to Host
cutilSafeCall( cudaEventRecord( rgGpuTimer_start[4], 0 ) );
cutilSafeCall( cudaMemcpy( h_odata, d_odata, global_mem_size, cudaMemcpyDeviceToHost) );
cutilSafeCall( cudaEventRecord( rgGpuTimer_stop[4], 0 ) );
cudaThreadSynchronize(); // Block until memory copy is done to ensure accurate timing
cutilCheckError( cutStopTimer( rgTimer[0]));
With this code in place I was ready to find out where the extra 80ms that the GPU took compared to the CPU was coming from and how much time each of the GPU tasks took. First a baseline comparison to verify that the code was still the same and gave the same numbers.
So here’s the graph from before on the left, and here’s the new graph, which should be identical, on the right:
Wow! What’s happened here? All the CPU times are the same, as expected, but the GPU has suddenly closed the gap and now takes only a few ms extra – the 80ms gap has vanished. A diff of the two versions shows that the only change to the code is the addition of GPU timing – and that turns out to be why the GPU suddenly sped up. Directly after setting the device, sending a wakeup call to the GPU like this:
if( cutCheckCmdLineFlag(argc, (const char**)argv, "device") )
cutilDeviceInit(argc, argv);
else
cudaSetDevice( cutGetMaxGflopsDeviceId() );
{
cudaEvent_t wakeGPU;
cutilSafeCall( cudaEventCreate( &wakeGPU) );
}
means that 80ms vanishes from the timed loop later in the code. Note that the variable is scoped so it isn’t used. Is the GeForce like a person – goes faster when it knows it is being watched?! Or is this some wakeup from a power saving mode, I’m not sure. This is the only extra code needed to cut 80ms from the timing which shows how tricky it is to time accurately on the ms scale, the slightest change can have a significant effect. It is always advisable to run tests on large volumes of data with a lot of loops to drown out one-off costs like this where possible. While on the topic of getting accurate performance readings note that all timing should be done on release code, particularly timing breakdowns as the SDK/common/cutil_readme.txt file states:
“These macros are compiled out in release builds and so they will not affect performance. Note that in debug mode they call cudaThreadSynchronize() to ensure that kernel execution has completed, which can affect performance.”
Well now that the extra 80ms has been eliminated what does our new GPU timing code show us about how the GPU spends its time? Here’s a chart showing the breakdown for a 16MB sample:
The majority of the time, and this holds for the other data sizes, is taken copying data back and forth. So experimentally it seems that the overhead in moving the data back and forth is quite significant. Of the 24.8ms required in total to process 16MB, 21.9ms were spent copying data. The actual processing takes almost no time. Running a variety of input sizes and timing each one can tell us what kind of bandwidth we are typically getting as shown in the table below where times are in ms:
| Copy: Host to Device | MB/s | Copy: Device to Host | MB/s | |
|---|---|---|---|---|
| 16MB | 9.0 | 1771.9 | 11.8 | 1359.3 |
| 32MB | 16.3 | 1966.0 | 22.2 | 1442.8 |
| 64MB | 30.6 | 2093.9 | 49.8 | 1285.4 |
| 128MB | 58.2 | 2198.2 | 83.9 | 1526.4 |
| 256MB | 114.9 | 2228.7 | 171.4 | 1493.4 |
We wanted to find how where the GPU was spending its time and now discovered that most of the time is in moving data back and forth. Can we now answer the question of where the GPU outperforms the CPU? Is 2GB/s the expected throughput? Well Nvidia provides a tool in the SDK to answer that – the “Bandwidth Test”. Running it through the provided GUI tool yields the following results:
Running on......
device 0:GeForce 9800 GT
Quick Mode
Host to Device Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2152.6
Quick Mode
Device to Host Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1919.2
Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 48507.8
So we can see for 32MB, performance is roughly in line with the template results so that’s case closed … or is it? Two things give cause for concern:
1) PCIe 2.0 is theoretically capable of 500 MB/s per lane and with a x16 slot there are 16 lanes. So throughput should be up around 8GB/s, not the 2GB/s observed.
2) What exactly does "Host to Device Bandwidth for Pageable memory" in the bandwidth test results mean? Pageable memory?
So I found out that the bulk of the time was in data copying, first confirmed that the speeds observed were similar to those given in the Nvidia test suite and then raised new questions about whether we were getting everything out of the hardware given 2GB/s observed and 8GB/s theoretical. So now I need to confirm that my hardware really is PCIe 2.0 x16 and figure out what pageable memory is.
Read the rest of the series:
Getting Started with CUDA (1/3) – SDK template
Getting Started with CUDA (3/3) – Pageable and pinned memory
4 Comments
You need to be careful when measuring the execution time of the kernel in release mode, because kernel execution is asynchronous and the CPU will continue executing the application even if the kernel has not completed. You really should add a call to cudaThreadSynchronize() between the kernel launch and the timer call to ensure that you are measuring the actual execution time. It may not matter much here since the kernel you are using is so simple and should finish relatively quickly, but it can have a large impact on performance measurements for real applications.
I used both GPU and CPU timers. The GPU timer is synchronized to the kernel threads and does measure the actual execution time. The CPU timer measures only wall clock or total elapsed time in this example.
You’re right. I should have read your code more carefully. I think I saw the “cutStartTimer” at the beginning of the code and blindly assumed that you used that throughout.
Hello, I followed your codes and found it very useful! I add corresponding “cutilSafeCall( cudaEventRecord( rgGpuTimer_start[i], 0 ) );” and “cutilSafeCall( cudaEventRecord( rgGpuTimer_stop[i], 0 ) );” to each function call in the simpleCUFFT example of NV’s SDK. I see that cutilCheckError( cutStopTimer( rgTimer[0])); printf(”GPU Processing time: %f (ms)\n”, cutGetTimerValue( rgTimer[0])); gets the total time spent by all the threads. My question is that can I use cutGetTimerValue to get the time spent by each thread? Thanks!
One Trackback
[...] Cheesy Code Programming in C, C++ and C# Skip to content AboutPrivacy Loading… [-] Getting Started with CUDA (2/3) – How is the GPU spending its time? » [...]