SIMD Front-end and Registers
For computation, the heart of Cayman is the array of programmable SIMDs (or cores), which actually execute the various kernels. The cores in Cayman significantly changed relative to the previous generations. Each core is a 16-wide SIMD processor. In Cayman, each lane of the 16 lanes is actually a symmetric VLIW4, and each lane will execute the same VLIW bundle. Previous generations used a VLIW5 lane, with a specialized fifth pipeline for transcendental operations and some arithmetic.
Since VLIWs are statically scheduled, no instruction scheduling hardware is needed within a SIMD. This is a direct contrast to Nvidia’s SM design, which requires scheduling logic for scoreboarding and resolving dependencies between the two issue ports and available hardware resources. For example, Fermi’s schedulers handle contention for the shared load-store unit and special function unit and also executing 64-bit instructions across both execution pipelines. While AMD’s approach is vastly more power and area efficient, it places a far greater burden on the compiler, which must find substantial instruction level parallelism (ILP) within each work-item. Static scheduling is less flexible and makes it more difficult to achieve peak performance on AMD GPUs, particularly for general purpose workloads. It is less of an issue for graphics because of the inherent parallelism means that finding ILP is relatively easy.
Each SIMD can have up to 8 work-groups in-flight, each from a different kernel. Since each work-group is 1 or 4 wavefronts, this translates into as many as 32 wavefronts. However, the actual number of wavefronts will depend on the resources needed. There are fixed resources per SIMD (e.g. registers, local data share), so resource intensive wavefronts will reduce the occupancy. When the dispatch processor schedules two wavefronts for execution in a SIMD, they will run to completion – stalling if necessary. When a SIMD switches to the next clause in the kernel, it takes roughly 40 cycles. This latency is hidden by having multiple wavefronts on each SIMD.
ALU wavefronts take 8 cycles to execute. The first 4 cycles are for reading the register file, one quarter-wavefront at a time. The second 4 cycles are for actually executing the operations, again a quarter wavefront at a time. To hide the 8 cycle back-to-back latency (most CPUs have only a single cycle), the two separate ALU wavefronts (even and odd) execute in an interleaved fashion. First one wavefront accesses the register file access, while the other executes; then they switch. This alternation continues until one finishes and is replaced by another wavefront. This is conceptually similar to fine-grained multi-threading, where the two threads switch every 4 cycles, but do not simultaneously execute.
Cayman and Cypress both have split instruction caches. Each clause type is relatively homogenous and one of the benefits of partitioning the instruction stream into clauses is that each clause type can have a separate instruction cache. These different types of instruction caches are shared between the SIMDs in various hierarchical arrangements, since there is substantial re-use for kernels that span multiple SIMDs. For example, instruction caches for ALU bundles are 48KB and shared between 4 SIMDs, while ALU constant caches are 4KB for 2 SIMDs. The vertex instruction cache is 6KB for each set of 12 SIMDs, and 12KB for the entire chip.
Figure 2 – Cayman SIMD Front-end, Registers and Comparisons
An instruction sequencer in the SIMD will fetch the next VLIW bundle or instruction in the clause and then decode it into actual operations. The operations in an ALU bundle can have 2 or 3 inputs and a single non-destructive output. 2 input operations are explicitly predicated, while 3 input operations must use an out-of-bounds register to suppress the result. The predication and lane masking is used for divergent control flow within a wavefront. The inputs can come from a combination of general purpose registers (GPRs), constants and the output of the previous VLIW bundle. An ALU clause can allocate 0-127 GPRs and up to 256 constants. The GPRs are actually 128-bits wide and hold four 32-bit values, described as the X, Y, Z and W elements (mapping to the 4 execution pipelines). Every GPR in a clause requires 64 copies across a wavefront. Some GPRs persist across multiple clauses, while others are only used as temporaries and are reclaimed at the end of a clause. For clauses accessing memory, the VLIW bundle will specify a GPR containing the read/write address, a destination for reads or a source for writes. The bundle will also indicate any additional manipulations, such as swizzles, broadcasts, predication, or texture/vertex specific operations (e.g. sampling and filtering).
The number of registers used in the executing wavefronts is one of the factors that can limit the number of wavefronts assigned to a SIMD. Each SIMD has 16K registers to share between 1-32 wavefronts. The registers are sized to hold four 32-bit values; if less data is stored, the register file will end up under-utilized. Thus it is critically important to pack together data into 128-bit chunks. That works out to an average of 8×128-bit registers per wavefront (recall that a wavefront is actually 64-wide). However, the global limit on wavefronts is ~20.6 per SIMD, which would translate into a slightly more generous 11-12 registers per wavefront. Some wavefronts will be very short and use few registers, e.g. memory loads or export, but as mentioned above, the ALU wavefronts can use a large number of GPRs. The register file is optimized for 32-bit values, so a 64-bit value will consume twice as much space. If the wavefronts on each SIMD use many registers, then the occupancy will decrease. For example, only 2 wavefronts can occupy a SIMD if they use 83 or more registers.
The SIMD register file logically contains 16K entries that are 128-bits long. The 256KB array is designed for the massive bandwidth necessary to feed 64 single precision multiply-accumulates every 4 cycles (requiring 192 inputs and 64 outputs). Since each register contains 4 operands, this means that a total of 48 reads (768B) and 16 writes (256B) are needed. One of the trade-offs in AMD’s VLIW architecture is that the compiler must explicitly map between the VLIW bundle inputs and the ports on the register files in the underlying hardware. This allows AMD to use simpler hardware for the register files, but complicates the compiler and can reduce utilization.
Physically, the register file is partitioned for bandwidth into 16 identical arrays, one for each of the VLIW lanes. The register file for each VLIW is 1K entries and further sub-divided into 4 separate arrays – one for each of the 32-bit X, Y, Z and W elements. The individual element register files are 1R+1W ported, so there are a total of 4 reads performed each cycle. The operations in an ALU wavefront can only have 3 source operands, thus the element register files are all read over 3 cycles. The fourth cycle is used to read operands for another wavefront (e.g. texture or memory). Every cycle up to 4 element registers are all read out into special virtual vector registers that act as temporary storage. In addition to the 12 element registers, as many as 4 constants can also be read into the virtual registers over the 4 cycle read period. Once a register is put into the virtual registers, it can be referenced multiple times over the 4 cycle period, thus amplifying the bandwidth of the register files and constants. A register can be re-used between different scalar operations within the VLIW bundle (e.g. if several operations read GPR1.X as an input). Registers can also be re-used as both the first and second input for a single scalar operation within the VLIW bundle (e.g. computing GPR1.X * GPR1.X). The results of the previous VLIW bundle are also held in a special virtual vector register (called PV) and can be a source for operands. The VLIW format itself encodes the mapping from the hardware (i.e. virtual vector registers) to the source operands of the currently executing VLIW bundle.