I’d added GPU based timing to my template code and found out that most of the time was spent copying data back and forth between the host and the device. The “Bandwidth Test” in the SDK gave roughly similar results although it mentioned something about pageable memory. But the big problem was the theoretical performance of PCIe 2.0 x16 far exceeded what I was seeing. So the first step was to confirm that both my graphics card and my motherboard supported and were using PCIe 2.0 x16. To do this I used CPU-Z and GPU-Z, with the following results:
So after confirming the hardware should have been capable of better speeds I took another look at the BandwidthTest. Running with the –help switch reveals several options:
C:\ProgramData\NVIDIA Corporation\NVIDIA CUDA SDK\bin\win64\Release>bandwidthTest.exe --help
Usage: bandwidthTest [OPTION]...
Test the bandwidth for device to host, host to device, and device to device transfers
Example: measure the bandwidth of device to host pinned memory copies in the range 1024 Bytes
to 102400 Bytes in 1024 Byte increments
./bandwidthTest --memory=pinned --mode=range --start=1024 --end=102400 --increment=1024 --dtoh
Options:
--help Display this help menu
--csv Print results as a CSV
--device=[deviceno] Specify the device device to be used
all - compute cumulative bandwidth on all the devices
0,1,2,...,n - Specify any particular device to be used
--memory=[MEMMODE] Specify which memory mode to use
pageable - pageable memory
pinned - non-pageable system memory
--mode=[MODE] Specify the mode to use
quick - performs a quick measurement
range - measures a user-specified range of values
shmoo - performs an intense shmoo of a large range of values
--htod Measure host to device transfers
--dtoh Measure device to host transfers
--dtod Measure device to device transfers
--wc Allocate pinned memory as write-combined
--cputiming Force CPU-based timing always
Range mode options
--start=[SIZE] Starting transfer size in bytes
--end=[SIZE] Ending transfer size in bytes
--increment=[SIZE] Increment size in bytes
Particularly of interest is the “pinned” memory mode. Let’s try that:
C:\ProgramData\NVIDIA Corporation\NVIDIA CUDA SDK\bin\win64\Release>bandwidthTest.exe --memory=pinned Running on...... device 0:GeForce 9800 GT Quick Mode Host to Device Bandwidth for Pinned memory . Transfer Size (Bytes) Bandwidth(MB/s) 33554432 5256.9 Quick Mode Device to Host Bandwidth for Pinned memory . Transfer Size (Bytes) Bandwidth(MB/s) 33554432 4891.6 Quick Mode Device to Device Bandwidth . Transfer Size (Bytes) Bandwidth(MB/s) 33554432 48498.6
and we see that this mode vastly improves the maximum throughput. Not sure why Nvidia didn’t make it the default option. Speeds are now up to 5GB/s. A short investigation of the code reveals that the timing isn’t quite analogous to the testing we are doing in the template code:
bandwidthTest.cu
56: // defines, project 57: #define MEMCOPY_ITERATIONS 10
as the bandwidthTest copies the same memory 10 times in a row as compared to the single copy we are doing. So we expect our performance to lag slightly behind this 5GB/s. Conveniently, all the code needed to use pinned memory is provided in the bandwidthTest, so putting it into a few wrapper functions called freeHost, mallocHost and memCpy yields:
////////////////////////////////////////////////////////////////////////////////
// Memory functions to switch between pinned and pageable memory as required
////////////////////////////////////////////////////////////////////////////////
cudaError
freeHost(void* h_mem, memoryMode memMode)
{
if( PINNED == memMode ) {
return cudaFreeHost(h_mem);
}
else {
free(h_mem);
}
return cudaSuccess;
}
cudaError
mallocHost(void** h_mem ,uint memSize, memoryMode memMode, bool wc)
{
if( PINNED == memMode ) {
#if CUDART_VERSION >= 2020
return cudaHostAlloc( h_mem, memSize, (wc) ? cudaHostAllocWriteCombined : 0 );
#else
if (wc) {printf("Write-Combined unavailable on CUDART_VERSION less than 2020, running is: %d", CUDART_VERSION);
return cudaMallocHost( h_mem, memSize );
#endif
}
else { // PAGEABLE memory mode
*h_mem = malloc( memSize );
}
return cudaSuccess;
}
cudaError
memCpy(void* sink, void* source, uint memSize, cudaMemcpyKind direction, memoryMode memMode)
{
if( PINNED == memMode ) {
return cudaMemcpyAsync( sink, source, memSize, direction, 0);
}
else {
return cudaMemcpy( sink, source, memSize, direction);
}
}
These functions take the same parameters as the existing functions with the addition of memory mode and for mallocHost whether or not the memory is Write Combined. Changing the allocation, copying and freeing over to these new functions allow use of pinned memory. Running the same test set shows that now the time is much more evenly spread between tasks:
and running the new numbers on the throughput we get:
| Copy: Host to Device | MB/s | Copy: Device to Host | MB/s | |
|---|---|---|---|---|
| 16MB | 3.2 | 5026.7 | 3.3 | 4878.0 |
| 32MB | 6.1 | 5242.5 | 6.5 | 4891.5 |
| 64MB | 12.2 | 5251.1 | 13.1 | 4871.7 |
| 128MB | 24.4 | 5247.6 | 26.2 | 4894.1 |
| 256MB | 48.9 | 5239.0 | 52.3 | 4894.7 |
So now the throughput approaches the theoretical limit and matches the best the bandwidthTest provides. The total times are down significantly and the GPU is faster on all tested sizes. The 256MB trial runs in 30% less time down from 340ms to 236ms.
The next challenge is to find where else time is lost. The pie charts show that most of the time is still spent in allocation and copying with very little in compute time so there’s no need to look at the kernel. We’ve already probably cut most of the time we can from the copying so that leaves allocation. A good idea would probably be to allocate the memory once and then use it over and over for multiple kernel executions, an intensive process like the kind Nvidia suggests are best suited for CUDA. But what if the code needs to be as shown, one kernel being run on one large set of data and then returning to another application? This is the kind of flow seen in Matlab MEX files where CUDA is used – Matlab passes the data through the C/C++ MEX file, which runs up a CUDA program gets the result and then returns to Matlab. Could parallel memory copies and allocations speed things up in this situation?
So we’ve switched the code over to use pinned memory in preference to pageable and attained the desired speedup in memory operations from 2GB/s to about 5GB/s. Theoretically PCIe 2.0 x16 should be able to hit 8GB/s and I don’t know why we aren’t able to achieve speeds closer to this number. If anyone knows please leave a comment or e-mail me. From here the next thing to investigate to get more throughput in the single kernel scenario is parallel allocations and copies.
Read the rest of the series:
Getting Started with CUDA (1/3) – SDK template
Getting Started with CUDA (2/3) – How is the GPU spending its time?
One Comment
My guess as to why NVIDIA chose to make paged memory the default option: allocating pinned memory is significantly slower than allocated paged memory. Thus, using pinned memory only makes sense for large transfers. Exactly how large is large enough most likely varies from system to system; I have observed a cutoff point of 16 MB. See here for more information:
http://www.cs.virginia.edu/~mwb7w/cuda_support/pinned_tradeoff.html