As a compute device, GPUs are somewhat similar to a multi-core CPU, albeit with two fundamental differences. The first is that GPU emphasize massively threaded throughput and SIMD performance, rather than the latency of a single instruction stream. The second difference is how CPUs and GPUs schedule threads. Conventional multi-core CPUs are pre-emptively thread scheduled through the operating system, which will track and assign threads to available hardware contexts. GPUs generally use dedicated hardware for co-operatively scheduling threads. GPUs receive commands from the driver and system and split up the thread and instruction scheduling between a control processor and the front-ends of each individual core.
There are also numerous pieces of fixed function hardware in the GPU that are tailored to the 3D pipeline. Tessellation hardware can be used to amplify the geometric detail of a model or scene. There is the triangle setup engine, which takes geometric primitives (usually some combination of lines and triangles), adds additional information and transforms them from a 3D space to the 2D screen space. The rasterizer then converts these triangles into pixels. The texturing units reside in each core and include special interpolation hardware for filtering (e.g. bi-linear, tri-linear and anisotropic). The Raster Operation Pipelines (ROP) live with the memory controllers and contain hardware for blending and multi-sample anti-aliasing. Last, there is the blitter, which combines the render targets into the final screen image. While most of this fixed function hardware is not relevant to computational applications of a GPU, some can be used for very specific applications. For example, the special purpose texture filtering hardware can be useful for certain image analysis applications. In situations where fixed function hardware can be used by an application, the performance benefits of a GPU are even greater.
Cayman includes two dispatch processors, which are responsible for managing the currently executing kernels and scheduling wavefronts onto the general purpose cores and fixed function hardware. Each dispatch processor is responsible for half of the SIMD array and also contains a dedicated tessellator, triangle setup engine and rasterizer. For graphics workloads, each dispatch processor is assigned to one of the two graphics engines. Compute kernels are round-robin partitioned between the two dispatch processors. Pixels kernels are tiled, according to screen space, while other kernel types (e.g. vertex, geometry, hull and domain) are also round-robin distributed. Each dispatch processor can have 248 wavefronts in-flight, for a global total of 496 wavefronts and 31K work-items across the GPU at a given time. However, there are other constraints based on the cores that may reduce the actual numbers. The dispatch processors each contain an 8KB instruction cache and 24KB of constant cache.
A dispatch processor has an available pool of up to 248 wavefronts, and will select two ready wavefronts to dispatch to each core for execution. The wavefronts are scheduled based on programmable, dynamic prioritization between different kernel types (e.g. vertex vs. pixel vs. geometry vs. compute) and age-based priority within a type of kernel. In the case of a full Cayman, that means about a tenth of the wavefronts can be executing simultaneously. This theoretical ratio is both related to the pipeline length for execution and also the latency for a trip to memory. The goal of having so much work in-flight is to hide the rather high GDDR5 memory access latency and maximize available bandwidth.
The Cayman dispatch processors were enhanced to take advantage of task level parallelism, in addition to the rich data level parallelism. Cayman is designed so that multiple applications can simultaneously send commands to the GPU. Each application has a separate command queues (containing the application’s kernels) and the address spaces are also protected from each other. While previous GPUs could have multiple applications executing on the GPU, there was only a single command stream and applications were serialized within it. The result is that applications can more effectively share the GPU and avoid resource contention caused by applications with a large command stream payload. Multiple command queues are also essential for exploiting task level parallelism in OpenCL applications.