The Gen 6 architecture is integrally tied to the Last Level Cache (LLC) and ring interconnect in Sandy Bridge, as seen in Figure 2. The ring is responsible for maintaining coherency and consistency and both ring and LLC operate at the CPU core frequency. The LLC for high-end models is a total of 8MB implemented as four slices that are each 2MB and 16-way associative.
The GPU driver can allocate regions of the LLC for the media or graphics pipeline. The media regions are non-coherent, while the graphics regions are coherent but weakly consistent, and the CPU regions of the LLC have traditional x86 consistency. Each slice of the LLC can send or receive (but not both) 32B/cycle onto the ring interconnect, and the GPU has a 32B port as well.
The memory controller resides in the system agent, and also has a port onto the ring interconnect. GPU memory buffers can be directed to the LLC, or to main memory, but in either case, they rely on the ring for access to data. The LLC is predominantly used for cache friendly data structures, such as commands and vertex data, but can also be used to hold texture data.
Figure 2 – Sandy Bridge and Llano System Architecture
Unlike AMD’s Fusion architecture or Ironlake, Sandy Bridge is intended to pass data between the CPU and GPU using the LLC. Since the GPU is weakly consistent, while x86 is strongly consistent, synchronization is needed to ensure safety when data is passed to the CPU. In contrast, the default coherency behavior of the ring and LLC is sufficient to guarantee correctness when the GPU receives data from the strongly consistent CPUs. Compared to accessing memory, on-die communication is vastly higher performance and more power efficient.
One of the unfortunate aspects of the unification of computer architecture and programmable graphics is the confusing terminology. Computer architects have a relatively standard vocabulary and syntax for describing microprocessors. While engineers from AMD, IBM, Intel and Oracle (or even ARM and MIPS) have slightly different terminology, it is more or less uniform and readily understandable to all. In part, this is because microprocessor designers are fairly transparent about their architectures; presenting them openly at conference and writing detailed tuning and optimization guides.
The world of graphics is bizarrely opaque and has no real standard terminology. AMD, Intel and Nvidia all have very different vocabulary to refer to the same underlying objects and principles. These terms are often determined by marketing, and the desire for bigger and better numbers, as opposed to a desire to clearly explain. When compared to computer architecture terminology, it is downright bewildering. For the purposes of this article, we use the term ‘shader core’ to refer to an Intel EU, an AMD SIMD or an Nvidia SM – which roughly corresponds to a single processor core within the GPU. Critically, each of these shader cores is complete and can independently fetch, decode, issue and execute various instructions. Each of the vector lanes within these shader cores, we refer to as an execution unit or EU – equivalent to an AMD streaming processor, an Nvidia CUDA core or one lane of an AVX/SSE instruction.
The graphics and media pipelines for Sandy Bridge begin with the command streamer, which reads in high-level commands (e.g. a draw call) and manages the 3D and media pipelines and fixed functions. The graphics and media pipelines cannot simultaneously be active, so the command streamer must select the mode of operation.
The command streamer is responsible maintaining the 3D pipeline and orchestrating the different stages (e.g. vertex fetch, vertex shading, geometry shading, clip/setup, pixel shading and raster output) across the programmable GPU cores and fixed function hardware. Each stage of the pipeline will pass data through to the next for processing, using a messaging framework. Messages can pass data by directly writing to registers, but also by using the Unified Return Buffer (URB). The URB is a globally shared and explicitly addressed data structure managed by the command streamer. The programmable shader cores can write to the URB, and the fixed function blocks have both read and write access. The URB is 64KB, organized as 2K entries that are 256-bits wide for the fully featured GT2 graphics (and Ironlake). The lower-end GT1 has only 6 shader cores and 1K entries in the URB. In some sense, the URB is conceptually similar to AMD’s Global Data Share structure; perhaps future versions will be fully accessible to the shader array.
Ironlake’s media hardware was not actually pipelined – a media command could not be issued until the prior one had fully finished and flushed all state. The Sandy Bridge media hardware and scheduler are pipelined so that multiple commands (and streams) can be in-flight – similar to the 3D pipeline. The media pipeline uses both fixed function hardware and the programmable GPU cores.
Discuss (65 comments)