What are the capabilities of Nvidia’s CUDA running on the GPU and how does it compare to CPU performance? I bought a GeForce 9800GT and set about finding out, starting off by installing the CUDA drivers, toolkit and SDK from the Cuda Zone.
The first thing I noticed was that on my Vista64 machine the sample projects had been installed to:
C:\ProgramData\NVIDIA Corporation\NVIDIA CUDA SDK\projects
which is read only. Rather than fight with Vista’s UAC I copied everything into the C:\CUDA directory. To build the solution in VS2008 on my Vista 64 machine all I needed to do was switch the platform to x64, ignore the warning:
Command line warning D9035 : option 'Wp64' has been deprecated and will be removed in a future release
and everything was fine. The SDK’s sample template conveniently included both a gold (CPU) implementation of a function and a GPU implementation. An initial run of the template project showed that only the GPU section was timed. Since the reason to use CUDA is performance and I wanted a comparison, the first modification I made was to put a timer around the CPU implementation:
cutilCheckError( cutStartTimer( timer));
computeGold( reference, h_idata, num_threads); // reference solution
cutilCheckError( cutStopTimer( timer));
and raced them – but the results weren’t too inspiring:
GPU Processing time: 84.362747 (ms)
CPU Processing time: 0.001257 (ms)
The CPU solution wasn’t even threaded. I remembered the question of a student at the Stanford CUDA lecture on YouTube:
Q: Since there’s overhead in moving the data to the GPU how do you decide when it’s worthwhile?
A: Generally speaking it makes the most sense for large problems with high data intensity where you have to do multiple calculations per data element.
Hmm, the template code only processed 128 bytes with 32 threads so I had paid the setup costs and then not sent enough data to the GPU – no wonder the CPU was faster. So I needed to increase the data set, but there’s a problem with that since the provided kernel code assumes the entire data set will fit in shared memory and binds the size of the data to the thread count. There needed to be some changes. But you can’t just increase the number of threads or you’ll get:
cutilCheckMsg() CUTIL CUDA error: Kernel execution failed in file <template.cu>, line 88 : invalid configuration argument.
First step was to find out what resources were available on the GPU, then I’d need to work out how to get at those resources. Running the SDK Device Query told me how much global and shared memory was available as well as how many threads I could use:
Device 0: "GeForce 9800 GT"
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 1
Total amount of global memory: 1073741824 bytes
Number of multiprocessors: 14
Number of cores: 112
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.50 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: No
Compute mode: Default (multiple host threads can use this device simultaneously)
Some interesting numbers there, since the GeForce can perform both a FMUL (2 flops) and a FADD (1 flop) per clock, per processor, we can calculate the maximum theoretical Gflops attainable is 1.5 GHz * 112 * (2 + 1) = 504 Gflops. By way of comparison, the E8400 in my test machine has a peak of 24 Gflops according to Intel’s data sheet:
But back to the problem of pushing more data through. A few problems:
1) The data size needs to be uncoupled from the thread count which means a change to the GRID count from this:
// setup execution parameters
dim3 grid( 1, 1, 1);
dim3 threads( num_threads, 1, 1);
to something more like this:
cThreadsPerBlock = 64;
cBlocksPerGridx = 1024;
cBlocksPerGridy = 1024;
cData = cThreadsPerBlock * cBlocksPerGridx * cBlocksPerGridy;
dim3 grid ( cBlocksPerGridx, cBlocksPerGridy, 1);
dim3 block( cThreadsPerBlock, 1, 1);
where the counts of Blocks Per Grid in the x and y directions would need to be data derived. To simplify the example I’ve done it backwards and set the data size based on thread and block breakdown. These grid and block variables are then be passed to GPU using the triple angle bracket <<< >>> notation:
testKernel<<< grid, block, shared_mem_size >>>( d_idata, d_odata);
which is the same as:
testKernel<<< grid, 64, shared_mem_size >>> ( d_idata, d_odata);
because the passed argument is converted to a CUDA dim3 type which “is an integer vector type based on uint3 that is used to specify dimensions. When defining a variable of type dim3, any component left unspecified is initialized to 1.” from the programming guide.
Specifying a shared_mem_size on the kernel call as above allows you to specify the size at runtime. You can then pick up a reference to the memory in the kernel code with:
extern __shared__ float sdata[];
Alternatively if you know the size at compilation time you can also declare the shared memory inside the kernel like this:
__shared__ float sdata[256];
Which would mean the kernel call would be just be:
testKernel<<< grid, 64 >>> ( d_idata, d_odata);
2) The kernel code must loop through the grid. Calculate the thread id, block id and then global id to figure where in the global data we are up to. Pass the size of the data(int len) since num_threads is no longer coupled with the data length. The __umul24 in the code provides increased performance but comes with a warning: “Throughput of 32-bit integer multiplication is 2 operations per clock cycle, but __mul24 and __umul24 provide signed and unsigned 24-bit integer multiplication with a throughput of 8 operations per clock cycle. On future architectures however, __[u]mul24 will be slower than 32-bit integer multiplication”.
__global__ void
testKernel( float* g_idata, float* g_odata, int len)
{
// shared memory
// the size is determined by the host application
extern __shared__ float sdata[];
// thread id
const unsigned int tid = threadIdx.x;
// block id
const unsigned int bid = __umul24(gridDim.x, blockIdx.y) + blockIdx.x ;
// global memory id
const unsigned int gid = tid + __umul24(blockDim.x, bid);
const unsigned int cThreadsPerBlock = __umul24(__umul24(blockDim.x, blockDim.y),blockDim.z);
3) The kernel needs to read from global memory and then synchronise across threads, this causes the threads across warps to sync and thus presents a consistent shared memory picture. So now thread 0 can read from SDATA(1) and will see the data which thread 1 loaded. A call to __syncthreads() is only needed when the count of threads per block exceed the warpSize because as mentioned in the performance optimisation whitepaper: "Instructions are SIMD synchronous within a warp". Of course every call has a cost and the programming guide states that “throughput for __syncthreads is 8 operations per clock cycle in the case where no thread has to wait for any other threads."
None of this is important in the sample template code because there is no communication between threads, thus no need for shared memory or thread syncing – a situation in which registers would normally be used but in this case shared memory has presumably been used by Nvidia for example purposes.
const unsigned int cThreadsPerBlock = __umul24(__umul24(blockDim.x, blockDim.y),blockDim.z);
SDATA(tid) = g_idata[tid];
if (cThreadsPerBlock > warpSize) __syncthreads();
At this point I had revised the template to time the CPU for comparison, remove the size restrictions to allow a decent amount of data to be pushed through and was ready to attempt to answer the question – given the overhead of pushing the data to the GPU, when it is worth doing so? Running the code gave some unexpected answers. Keeping the thread count constant I varied the cBlocksPerGridy to yield various data sizes:
The GPU and CPU seemed to take the same amount of time with different data loads but the GPU was hampered by a constant overhead of 80ms, the exact same difference I noted when only 128 bytes were trialled in the very first instance before any modification. Where was the time going? Some sort of setup cost? Also how much was being taken in the kernel and how much in the data transfer? I needed more fine grained data to see what was going on.