Are GPUs For You?
Graphics processors (GPUs) are optimized for high throughput, offering
hundreds of hardware processing units (with performance up to 1 Tflop), deep multithreading in each processing element
(which allows the GPU to do continue doing useful work even when some threads
are stalled),
and high memory bandwidth. These hardware capabilities were originally
developed to accelerate rendering of 3D models, but as programmers demanded
greater programmability, GPUs have adopted general-purpose instruction sets and
can be programmed with in a fairly conventional C, C++, and Fortran style.
Today's high-end GPUs offer performance in a single card that would require a large cluster
if using only conventional multicore CPUs. Speedups of 100X
or more are commonly reported when accelerating an application with a GPU.
Speedups of 25X or more are common even compared to parallelization over a
quad-core CPU. However, the architecture of the GPU is better suited
to certain types of applications.
Some brief highlights of the programming model:
- Work is offloaded to the GPU by applying or "mapping" a function (called a kernel)
in parallel over a coordinate space (e.g. every point in a matrix).
- The kernel is expressed as the work of a single thread at each point in
the coordinate space. A thread is launched at every point in the
coordinate space.
- Because the GPU is a coprocessor on a separate PCI-Express card, data must first be explicitly copied from the system memory to the
memory on the GPU board.
- The GPU is organized as multiple SIMD groups. Within one SIMD
group or "warp" (32 threads with NVIDIA CUDA), all the processing elements execute the same instruction in lockstep.
A set of threads that executes in lockstep this way is called a "warp."
Branching is allowed, but if threads within a single warp follow divergent
execution paths, there may be some performance
loss.
- Memory interfaces are wide and achieve highest bandwidth when that
access width is fully utilized. For applications that are memory
bound, this means that all threads in a warp should access adjacent data
elements when possible. For example, neighboring threads in a warp
should access neighboring elements in an array. This may require some
rearrangement of data layout or data access patterns.
- The GPU offers multiple memory spaces that can be used to exploit common
data-access patterns: in addition to the global memory, there are constant
memory (read-only, cached), texture memory (read-only, cached, optimized for
neighboring regions of an array), and per-block shared memory (a fast memory
space within each warp processor, managed explicitly by the programmer).
- Two main platforms exist for GPU programming: OpenCL, a new
industry-wide standard whose programming style resembles OpenGL; and CUDA
for C/C++/Fortran, which is specific to NVIDIA GPUs.
- The OpenCL/CUDA compilers do not magically transform sequential C code into
parallel CUDA code. The most important job for the programmer is to select
an algorithm and data structures that are amenable to parallel processing.
For example, mergesort or radix sort will be more effective than heapsort or
quicksort. Some programming effort is also required to write the necessary CUDA kernel(s) as well as to add code to transfer data to the GPU, launch the
kernel(s), and then read back the results from the GPU.
What works well:
- Applications with kernels that have large numbers of parallel threads
(think thousands or more!).
- Applications in which data exchange among threads can be localized to
threads that are nearby in the kernel's coordinate space, allowing use of
the per-block shared memory.
- Applications with "data parallelism," in which many threads do similar
work across many data points. Loops are a common source of data
parallelism.
- Applications using hardware-supported transcendentals, such as
reciprocals and reciprocal square root. Note that you must be sure to
turn on the "fast-math" option which allows use of these hardware-supported
functions (because they are not precisely compliant with the IEEE
floating-point standard).
- Applications which require a large amount of computation per data element
and/or make full use of the wide memory interface.
- Applications with infrequent synchronization.
What doesn't work well:.
- Applications with limited concurrency. Usually hundreds to
thousands of threads are required to utilize enough of the GPU's capacity to
see speedups.
- Irregular task parallelism--even with many threads, if all the threads
are doing different work, the GPU will not be efficiently utilized.
However, depending on the nature of the work and how often the threads must
coordinate, speedups are still possible.
- Frequent global synchronization. This requires a global barrier,
which is expensive.
- Applications with high degrees of point-to-point synchronization among
random threads. GPUs do not support this, generally requiring a global
barrier at each potential synchronization point. Restructuring of your
algorithm can often avoid this, however.
- Applications requiring frequent communication between the GPU and CPU.
This incurs significant overhead--moreso if data transfer is
involved. Restructuring of your algorithm can often avoid this,
however.
- Applications which require a small amount of computation (relative to
the amount of data transferred) before returning to the CPU,
because any improvements to the performance of the computation itself will be offset by
the overhead of transferring the data to the GPU's memory. For example,
except for extremely large vectors computing
the sum of two vectors, despite being an embarrassingly parallel operation, will
be significantly slower when offloaded to the GPU than executed on the CPU because of the
memory transfer overhead.
Other points to note:
- The GPU memory consistency model is relaxed (i.e. not sequentially
consistent). A fast hardware barrier within a warp processor allows
threads on a single warp processor to share data efficiently, especially if
using the per-block shared memory. However, a global fence is required
before one warp processor can safely use results from another warp
processor.
- Peak computational throughput is achieved when all threads have roughly
equal execution time. A thread block cannot complete and release the
hardware for the next chunk of work until all threads in the thread block
are done.
- Peak memory bandwidth is achieved when threads in a warp access
neighboring regions of data. This is important because memory bandwidth can
be a bottleneck for applications that do very small amounts of work per data
element.
- For some examples of CUDA programs, see the
Rodinia or
CUDA Zone
websites. An analysis of potential speedups and performance
optimization considerations for a diverse set of application types can be
found
in this JPDC paper.
- For other programming tips, see the NVIDIA
CUDA and
OpenCL websites.
When is a GPU preferable to other ways of parallelizing (e.g. a cluster)?
- If you do not yet have a cluster: the cost of buying a GPU (or even, if
necessary, a new PC to host a high-end GPU) is an order of magnitude
lower than the cost of building a cluster. Setup, maintenance, and
upgrade costs are also drastically lower.
- A GPU system can be deployed at each user's desktop or benchtop.
- A cluster can be constructed with a GPU at each node.
For more information, please contact
Kevin Skadron.
Last updated 18 Oct. 2009
More information on UVACSE