The thread scheduling within a core is primarily hardware managed. The highest priority thread with a ready instruction is sent down the pipeline and can execute for several cycles. A thread will stall if an instruction is still waiting for operands and will be switched out. There is a certain degree of software management though. An instruction can force a thread switch upon completion, and atomic instructions will always have the highest priority.
One of the major enhancements to Gen 6 is better control flow. The Sandy Bridge graphics cores have new instructions and native support for while loops, calls, returns, indexed jumps and case statements. Additionally, there is an instruction pointer for each channel and infinite nesting capabilities for recursion. In Ironlake, infinite nesting required software assistance, which dramatically reduced performance.
Figure 4 – Shader Front-end Comparison
The 4KB L1 instruction cache is shared by a row of 3 cores, to improve re-use and exploit locality. The L1 caches are backed by a single shared 24KB L2 instruction cache. The instruction caches use 64B lines that contain 4 fixed length instructions and do not support self-modifying code. Presumably there is an instruction fetch buffer in each core that can hold 4 instructions per thread.
The Gen 5 & 6 instruction set is fairly complex and quite powerful. The overall vector length of a single instruction is the (vector width * channels) and is limited by the register size. There are also compressed instructions, which operate on twice as many data elements and are decoded into two native instructions. Conditional execution is achieved using predication, destination masks and execution masks.
To improve efficiency and flexibility, the operands for instructions do not need to be aligned within the register file. Instructions can use region register addressing for operands, which is essentially a 2-dimensional strided gather within the register file that can span multiple physical registers. This is particularly handy for avoiding packing and unpacking of data structures or working with media data that has irregular alignment. Additionally, indirect register operands are available, where a separate address register and an offset indicate the location of the actual input operand.
Instructions access registers in either a 16B or 1B aligned mode. The 16B aligned mode is intended for 4-component data (RGBA) packed into 16B blocks. This mode has full source swizzling and destination masking for operands, but limited register regions since any accesses must be 16B aligned within the register file. The 1B aligned mode is targeted for SOA execution. Operands must be aligned to their natural data type within the register file (down to 1B) and can use the full register region addressing capabilities, however, source swizzling and destinating masking are disabled. Together these two modes mean that the GPU is mostly indifferent to the data formatting and can easily transpose a data structure from SOA to AOS.
The majority of instructions are 3-operand with source swizzling and destination masking, although some have an implied fourth operand. Both inputs can use region addressing, but only the first source operand can use indirect addressing. A few instructions, such as the multiply-add, have 3 inputs, but they are required to use 16B alignment.
Registers in the Gen 5 & 6 architecture are all 256-bits (32B) wide, and most operations expect similar sized chunks of data. The register size is related to the different execution modes discussed previously. When using single precision floating point data, each input operand for SIMD1x8 instructions perfectly corresponds to a single register. The longer SIMD1x16 instructions are compressed and will essentially decode into two separate instructions with a register for each input operand. A Gen 6 core contains a general purpose register file (GRF), a message register file (MRF) and an architecture register file (ARF, not shown).
The Gen 6 GRF is 640 entries for a total of 20KB of data and is used for computation by each thread. Each of 5 threads will allocate 128 entries when it is dispatched to a core – unlike AMD and Nvidia, threads do not have a variable number of registers. The thread can freely read and write to the GRF and spill to a 32KB region of memory that is held in Sandy Bridge’s L3 cache. Each register holds multiple values, and the Gen 6 architecture natively supports sub-register accesses as small as 1 byte and up to 4B or 32-bits; there is no double precision currently. The register file is physically split into odd and even banks that can be accessed in parallel for high bandwidth; every cycle a bank can read a register and write a register.
The other two register files are special purpose. The MRF is used by the messaging framework to communicate with other cores and fixed function blocks in the GPU. It contains 24 registers (0.75KB) per thread and is a write-only structure; each thread writes messages into the MRF that will be sent to other parts of the GPU. When a message is sent or returned to a core, the contents will actually be written into the GRF so that the data can be subsequently read by the receiving thread. This technique is used to pass data between threads on the cores, and also to initialize data values in the GRF when a thread is first dispatched. A thread may have multiple messages queued up or in-flight at any time.
The ARF contains a variety of registers that are used for managing and controlling the threads in a core. This includes registers that hold the instruction pointer (IP), thread priority and dependency information, notifications from the messaging framework and flags for control flow, per-channel IPs and exceptions. The ARF also includes 2 address registers which are used for indirect register addressing and 2 accumulator registers for higher precision operations.
Discuss (65 comments)