SIMD Execution Units
The execution units in each SIMD lane of AMD’s architecture are arranged as a VLIW and used to execute the various clause types. The overall design dates back to the R600, which debuted in 2006. There have been incremental refinements along the way, supporting new instructions and increasing IPC. For example, the RV770 included double precision floating point and Cypress added a number of DX11 bit-level instructions (count, insert, extract), sum-of-absolute differences for video encoding and a fused multiply-add.
With Cayman, the VLIW has been fundamentally re-architected to closer match general purpose workloads and take a step away from the singular focus on graphics. The original design was a VLIW5, with four symmetric XYZW pipelines, a fifth unit that also handles transcendental and data type conversion operations (the T-unit) and a branch unit. Cayman’s VLIW4 eliminates the T-unit and enhances the XYZW pipelines to handle all operations.
Figure 3 – Cayman SIMD Execution Pipelines and Comparisons
AMD’s VLIW pipelines are a multi-precision, staggered design that can bypass results between the pipelines. The operations in a VLIW bundle can be independent (just like a 4-wide SIMD), but that is not strictly speaking necessary. Starting with the Cypress microarchitecture, two pairs of serially dependent instructions can be packed into a single VLIW bundle (e.g. 2 pairs of 2 dependent additions) – although the compiler does not often take advantage of this yet. The forwarding capability is also used for reduction operations like a dot product or taking the maximum within a VLIW bundle. It is also used to microcode operations that are not natively supported by the hardware execution resources, a technique that is essential for Cayman.
Every cycle, a quarter of a wavefront will execute across all 16 VLIW lanes, using the previously gathered virtual registers for input. The XYZW and T pipelines contain execution units for integer and floating point operations. The integer execution units work with both normal 32-bit data types and also the shorter 24-bit integers commonly found in graphics. All of the pipelines in Cayman and Cypress (XYZW and T) can execute 24-bit multiply-add and 32-bit add operations. However, the entire VLIW can only execute a single 32-bit integer multiply, multiply-add, or a 64-bit integer operation.
Over time, AMD GPUs, starting with the RV670, have moved towards full IEEE compliance. Cypress was the first to enjoy complete support for double and single precision IEEE floating point. Each pipeline (XYZW and T) can execute single precision multiply-add, applying standard rounding to the result of the multiplication, before passing it to the addition. In Cypress, only the 4 main pipelines can execute a fused multiply-add (FMA), which only rounds the final result. So while a Cayman SIMD lost a little performance for separate add, multiply or multiply-add, the FMA performance is the same as Cypress. Cayman and Cypress both use multiple operations for double precision floating point instructions. The newer VLIW4 has identical performance to the older VLIW5 – despite losing the T-unit. Both can execute 2 double precision adds, 1 MUL or 1 FMA.
However, the biggest difference is the execution of transcendental instructions, which previously used the T-unit in the VLIW5. The T-unit included a dedicated look up table for computing common transcendental functions used in graphics such as logarithms, square roots, reciprocals and trigonometric functions. The Cayman VLIW4 distributes several smaller lookup tables to each of the XYZW pipelines to replace the T-unit. Transcendental instructions are microcoded across three of the four pipelines, using a 3 term Lagrange polynomial interpolation. Thus Cayman VLIWs have comparable transcendental performance (one instruction per clock) and increase the ALU utilization in order to save area. In theory, this could decrease performance for a workload that packed 1 transcendental and 4 ALU ops into every VLIW bundle, since Cayman can only pack 1 additional ALU op. However, the reality is that most workloads do not come close to fully utilizing every slot in the VLIW bundles, so in some sense this approach is ‘free’.
Removing the T-unit has a number of advantages. From a software perspective, the symmetric VLIW4 is dramatically simpler for the compiler to schedule and manage than the asymmetrical VLIW5. The mapping between the register file ports and the VLIW bundles is also much more straight forward. When looking at the hardware, removing the T-unit simplifies the control and data path routing in the VLIW. Additionally, it saves area by using separate smaller look up tables for approximation, rather than a single large table. This in turn increases performance, since AMD was able to add more SIMDs. The aggregate performance for Cayman is quite impressive – each of the 24 SIMDs are capable of 128 single precision or 32 double precision FLOP per cycle, a total of 2.7 TFLOP/s (675 GFLOP/s double precision) for the high-end model. Note that the double precision performance is one quarter of single precision – yet another example of how AMD is emphasizing graphics. Nvidia’s Fermi has additional hardware so that double precision performance is half of single precision.
Once a VLIW bundle has executed, the results from each operation are written into PV, the virtual vector register for the most recently executed bundle. In Cypress, there was also a separate PS register for the output of the T-unit, but that has obviously disappeared. The architecturally visible output register (PV) behaves like a software controlled write buffer and is actually reasonably common within the world of VLIW architectures. The PV registers are used to defer the writeback into the register file while still allowing the next bundle to access the data.
The last unit worth mentioning is the oft-overlooked branch unit, which is used to execute control flow instructions. Note that control flow and predication are fundamentally used for two different purposes. Predication is used to manage control divergence within a wavefront by suppressing lanes in a SIMD. Control flow is actually used to change the program counter. The branch unit is fairly simple and mostly used for comparisons in conditional statements. Cayman also added new instructions for case-statements, to avoid using a series of nested if-statements.