Texture Sampling Pipeline
One of the big differences between Intel’s graphics and the high powered architectures from AMD and Nvidia is the memory pipeline. To span the entire graphics market, the memory hierarchy must scale with the number of shader cores; roughly a factor of 12X. However, Intel’s target market is much narrower – Sandy Bridge graphics only has two models with ~2X difference in performance.
Intel’s architects designed a shared memory pipeline that entirely sits outside the cores, in contrast to the approach taken by AMD and Nvidia. The memory hierarchy for Gen 5 and Gen 6 graphics is accessed by the shader cores entirely through the messaging framework. The main components are a single texture sampling engine and a data port that controls all other caches and the render output pipeline. As Figure 6 shows, AMD’s texturing pipelines and L1 texture caches are distributed within each shader core, rather than centralized – and the L2 texture caches are replicated for each memory channel.
Figure 6 – Texture Sampling Pipeline Comparison
Both Ironlake and Sandy Bridge have a virtual memory model that relies on 4KB pages, to ensure compatibility with x86, in a two level translation structure. Both linear and 4KB tiled address spaces are supported; the latter is essential for rectangular buffers that are common in graphics. Up to 2GB of memory can be mapped and all pages must be locked so that they cannot swap to disk. The graphics page tables indicate if a given page is snooped/coherent (system memory), or un-snooped (main memory) and a special global page table is used for memory that can also be accessed by the CPU.
The texture sampling is a read-only memory pipeline used for both graphics and media applications. The sampling engine receives commands and co-ordinates from the cores through the messaging framework. The texture addressing unit takes the base co-ordinates of four pixels and will generate up to 32 texel addresses based on the mode and level of anisotropic filtering. The sampler handles 4 component packed data that is 8-bit, 16-bit or 32-bit (corresponding to 32-bit, 64-bit and 128-bit wide texels).
The texture accesses probe the set-associative texture caches; the L1 texture cache is 4KB and backed up by a larger 16KB L2 cache. These caches are read-only and are explicitly managed by the driver. The texel data is optionally filtered to yield color values. The pipeline can gamma correct textures and also selectively change the sampled texels to be black or transparent, based on the color values. This technique is referred to as chroma keying and is used for compositing. Table 2 shows the performance for Sandy Bridge’s texture sampling pipeline.
Table 2 – Sandy Bridge Texture Sampling Performance
As the table shows, the texturing throughput heavily depends on the data format and filtering complexity. Although Intel did not disclose the precise bandwidth from the L1 texture cache, the overall sampling performance implies 128B/cycle of bandwidth. Interestingly, the throughput for 128-bit texture data is 2X lower than expected for bi-linear and tri-linear filtering. If texture cache bandwidth were the only limitation, 128-bit textures should run at half the speed of a 64-bit format – suggesting that the 32-bit filtering (rather than texture lookup) may be the bottleneck. The anisotropic filtering (AF) performance is relatively low, but is primarily used for pixels on surfaces that are at an angle with respect to the view. Most pixels will use bi-linear or tri-linear.
Unfortunately, reviews have shown that Intel has taken several short cuts with texture filtering that degrade image quality. Specifically, their algorithm appears to be highly angle dependent and produces undesirable results in synthetic tests. In practice, the impact seems to be noticeable, but not prominent, for games. These kind of short cuts were common with earlier products from Nvidia and ATI (e.g. the GeForce FX), but should not be a problem in 2011 and will hopefully be rectified in future GPUs.
Beyond graphics, the texturing is also used extensively for encoding and decoding media. The sampling engine contains fixed function hardware that can apply denoise filtering to clean up a video stream. There is also a block that detects and corrects video interlacing to avoid lower quality interpolation of interlaced frames. Lastly, video scaling and image enhancement use the texturing pipeline. The adaptive video scaler applies an 8×8 sharpening filter and a bi-linear smoothing filter, and then blends the two together to produce a final output. These techniques can also be applied to a static image as well.
Discuss (65 comments)