Are GPUs For You?

Graphics processors (GPUs) are optimized for high throughput, offering hundreds of hardware processing units (with performance up to 1 Tflop), deep multithreading in each processing element (which allows the GPU to do continue doing useful work even when some threads are stalled), and high memory bandwidth. These hardware capabilities were originally developed to accelerate rendering of 3D models, but as programmers demanded greater programmability, GPUs have adopted general-purpose instruction sets and can be programmed with in a fairly conventional C, C++, and Fortran style. Today's high-end GPUs offer performance in a single card that would require a large cluster if using only conventional multicore CPUs. Speedups of 100X or more are commonly reported when accelerating an application with a GPU. Speedups of 25X or more are common even compared to parallelization over a quad-core CPU. However, the architecture of the GPU is better suited to certain types of applications.

Some brief highlights of the programming model:

Work is offloaded to the GPU by applying or "mapping" a function (called a kernel) in parallel over a coordinate space (e.g. every point in a matrix).
The kernel is expressed as the work of a single thread at each point in the coordinate space. A thread is launched at every point in the coordinate space.
Because the GPU is a coprocessor on a separate PCI-Express card, data must first be explicitly copied from the system memory to the memory on the GPU board.
The GPU is organized as multiple SIMD groups. Within one SIMD group or "warp" (32 threads with NVIDIA CUDA), all the processing elements execute the same instruction in lockstep. A set of threads that executes in lockstep this way is called a "warp." Branching is allowed, but if threads within a single warp follow divergent execution paths, there may be some performance loss.
Memory interfaces are wide and achieve highest bandwidth when that access width is fully utilized. For applications that are memory bound, this means that all threads in a warp should access adjacent data elements when possible. For example, neighboring threads in a warp should access neighboring elements in an array. This may require some rearrangement of data layout or data access patterns.
The GPU offers multiple memory spaces that can be used to exploit common data-access patterns: in addition to the global memory, there are constant memory (read-only, cached), texture memory (read-only, cached, optimized for neighboring regions of an array), and per-block shared memory (a fast memory space within each warp processor, managed explicitly by the programmer).
Two main platforms exist for GPU programming: OpenCL, a new industry-wide standard whose programming style resembles OpenGL; and CUDA for C/C++/Fortran, which is specific to NVIDIA GPUs.
The OpenCL/CUDA compilers do not magically transform sequential C code into parallel CUDA code. The most important job for the programmer is to select an algorithm and data structures that are amenable to parallel processing. For example, mergesort or radix sort will be more effective than heapsort or quicksort. Some programming effort is also required to write the necessary CUDA kernel(s) as well as to add code to transfer data to the GPU, launch the kernel(s), and then read back the results from the GPU.

What works well:

Applications with kernels that have large numbers of parallel threads (think thousands or more!).
Applications in which data exchange among threads can be localized to threads that are nearby in the kernel's coordinate space, allowing use of the per-block shared memory.
Applications with "data parallelism," in which many threads do similar work across many data points. Loops are a common source of data parallelism.
Applications using hardware-supported transcendentals, such as reciprocals and reciprocal square root. Note that you must be sure to turn on the "fast-math" option which allows use of these hardware-supported functions (because they are not precisely compliant with the IEEE floating-point standard).
Applications which require a large amount of computation per data element and/or make full use of the wide memory interface.
Applications with infrequent synchronization.

What doesn't work well:.

Applications with limited concurrency. Usually hundreds to thousands of threads are required to utilize enough of the GPU's capacity to see speedups.
Irregular task parallelism--even with many threads, if all the threads are doing different work, the GPU will not be efficiently utilized. However, depending on the nature of the work and how often the threads must coordinate, speedups are still possible.
Frequent global synchronization. This requires a global barrier, which is expensive.
Applications with high degrees of point-to-point synchronization among random threads. GPUs do not support this, generally requiring a global barrier at each potential synchronization point. Restructuring of your algorithm can often avoid this, however.
Applications requiring frequent communication between the GPU and CPU. This incurs significant overhead--moreso if data transfer is involved. Restructuring of your algorithm can often avoid this, however.
Applications which require a small amount of computation (relative to the amount of data transferred) before returning to the CPU, because any improvements to the performance of the computation itself will be offset by the overhead of transferring the data to the GPU's memory. For example, except for extremely large vectors computing the sum of two vectors, despite being an embarrassingly parallel operation, will be significantly slower when offloaded to the GPU than executed on the CPU because of the memory transfer overhead.

Other points to note:

The GPU memory consistency model is relaxed (i.e. not sequentially consistent). A fast hardware barrier within a warp processor allows threads on a single warp processor to share data efficiently, especially if using the per-block shared memory. However, a global fence is required before one warp processor can safely use results from another warp processor.
Peak computational throughput is achieved when all threads have roughly equal execution time. A thread block cannot complete and release the hardware for the next chunk of work until all threads in the thread block are done.
Peak memory bandwidth is achieved when threads in a warp access neighboring regions of data. This is important because memory bandwidth can be a bottleneck for applications that do very small amounts of work per data element.
For some examples of CUDA programs, see the Rodinia or CUDA Zone websites. An analysis of potential speedups and performance optimization considerations for a diverse set of application types can be found in this JPDC paper.
For other programming tips, see the NVIDIA CUDA and OpenCL websites.

When is a GPU preferable to other ways of parallelizing (e.g. a cluster)?

If you do not yet have a cluster: the cost of buying a GPU (or even, if necessary, a new PC to host a high-end GPU) is an order of magnitude lower than the cost of building a cluster. Setup, maintenance, and upgrade costs are also drastically lower.
A GPU system can be deployed at each user's desktop or benchtop.
A cluster can be constructed with a GPU at each node.

For more information, please contact Kevin Skadron.

Last updated 18 Oct. 2009
More information on UVACSE