Pages: 1 2
In part one of our investigation into Knights Landing, we described several possible options for the CPU core and came to the conclusion that Intel probably designed a custom core for Knights Landing. Over the past few weeks, a number of leaked presentations appeared online. Unsurprisingly, these leaks are equally illuminating and confusing. Most of the leaked slides are old and do not fully and accurately represent Intel’s choices for Knights Landing (KNL). That being said, the slides definitely contain helpful information – the trick is picking out the signal from the noise and extrapolating to other aspects of Knights Landing.
As a starting point, it is good to recap the information from the leaked slides:
- Products are targeting >3 TFLOP/s double precision
- Knights Landing uses 72 “Silvermont” cores
- Each “Silvermont” core includes 2 vector units
- Two “Silvermont” cores plus a shared 1MB L2 cache form a tile
- A mesh fabric routes between the tiles
- DDR4 memory controller
- Upto 16GB of on-package eDRAM
While some of these rumors are difficult to verify, quite a few check out. To start with, the rumored goal of >3 TFLOP/s double precision is quite sensible from a competitive and technology standpoint. KNL is built on a 14nm process and should deliver better than a 2× increase in performance over KNC – and 3 TFLOP/s is also consistent with the expected performance from Nvidia’s GPGPU focused offerings in a similar time frame.
Knights Landing is not fully compatible with the previous generation, since AVX-512 is encoded differently than LRBni – but it is likely that the two designs are highly ‘performance compatible’. More specifically, the architects for Knights Landing probably ensured that any software tuned for Knights Corner will run well on Knights Landing without additional effort. Understanding this guiding principle helps to validate many of these rumors and think about the overall architecture of Knights Landing.
Knights Landing Core
The custom core in Knights Landing is derived from Silvermont, but with substantial modifications. KNL probably uses 4-way multi-threading for latency tolerance and also performance compatibility with the previous generation. The rumors also state that the KNL core will replace each of the floating point pipelines in Silvermont with a full blown AVX-512 vector unit, doubling the FLOPs/clock to 32.
This approach makes perfect sense given the dual-issue nature of the core and would substantially simplify the fabric and interconnect design. To deliver nearly 2-3× higher performance in KNL at similar frequency, a naïve approach requires at least twice the number of cores and a significantly more complex interconnect. Doubling the per-core FLOP/s decreases the pressure on the interconnect, which is critical. For throughput architectures the biggest challenge is not computation, but the data movement and attendant behavior such as coherency, so increasing the complexity of the core to simplify the interconnect is a logical step.
Turning to the memory hierarchy in the KNL core, there is relatively little information available, but it is quite possible to sketch out the most likely configuration. AVX-512 is a load-op instruction set that can source one operand from memory for each operation (typically assumed to be a fused multiply-add or FMA). This means that the KNL core must have a 512-bit (64B) wide load pipeline for each vector unit. In contrast, the Silvermont memory cluster contains separate and dedicated load and store pipelines that are only 16B wide. To ensure performance compatibility, the KNL cache will be at least 32KB and 8-way associative with an aggregate read bandwidth of 128B/cycle. Given the 1-1.5GHz target frequency, this translates to 128-192GB/s, which is slightly lower than the L1D bandwidth in Haswell.
To summarize, most of the rumors concerning the Knights Landing core are correct, although we can add a fair bit of information regarding the actual memory pipeline and L1D cache design. Each core will offer 32 FLOPs/clock and 128B/cycle of bandwidth from the L1 data cache. To hit the stated performance goals, it is likely that Knights Landing will target 1.3-1.4GHz, but a more cautious estimate is 1.1-1.5GHz.
The next set of rumors concern the hierarchy of cores, fabric and external interfaces for KNL. These are the most critical aspects of the entire design from a power and performance standpoint. While the cores in a GPU or throughput computing device determine the theoretical peak performance, the achieved performance is far more important and heavily influenced by system architecture.
Knights Landing and Skylake
Knights Landing is Intel’s first clean sheet design for throughput computing, and will continue to scale across several process nodes. The existing ring-based fabric in Knights Corner is realistically at the end of its useful life. While the ring scaled to 64 cores, that design point was certainly outside of the sweet spot and cannot be extended further. Since Intel will scale Knights Landing derivatives down to 10nm and 7nm, a new fabric is a necessity.
Before discussing the technical attributes of such a fabric, it is important to recognize the underlying economic realities and constraints. Intel’s newfound emphasis on a SoC design philosophy suggests that Knights Landing and Skylake will share many basic building blocks. First, Knights Landing is a relatively low volume product – it would be a wild success if Intel could sell 1M units a year. In contrast, Intel is already selling ~15M mainstream server processors today. From an economic standpoint, Intel needs to amortize as much of the Knights Landing development costs over high volume servers as possible, since this will lower the costs needed to compete in the HPC market.
Technically, the design targets for KNL and Skylake-EX are very similar, but staggered by several years. The core counts for both product families are high enough, that bandwidth is critical to deliver sustained performance and a central design goal. Intel’s mainstream server processors (e.g., Ivytown) already have 15 cores and Skylake-EX will probably have 20-30 cores. Assuming that each new process node increases core count by 50%, this means that Skylake EX is only 4-5 years behind KNL. As a result, sharing infrastructure has the benefit of using Knights Landing as the proverbial canary in the coal mine. In theory, any scalability problems will get ironed out before they can trickle down to the vastly more lucrative mainstream server products.
Given that Knights Landing features a new custom core, and the similarities between Knights Landing and Skylake-EX, it stands to reason that the rest of the chip is shared with Skylake. Specifically, the on-die fabric and system infrastructure (e.g., last level cache, memory controllers, I/Os) are likely to be identical or at least closely related.
Knights Landing Tiles
The first level of hierarchy in KNL is the tile, which the rumors describe as a pair of cores sharing a 1MB L2 cache. The ‘tile’ concept itself is quite plausible; while there may not be substantial data sharing between cores, the instructions are highly likely to be shared. Moreover, splitting the design into 36 tiles rather than 72 independent cores simplifies the fabric design. While each agent (i.e., KNL tile) has higher bandwidth than the agents in KNC (i.e., a single core and L2 cache), the number of agents is much lower which reduces latency and simplifies the fabric topology.
Unfortunately, the rumored details of the KNL tiles are fairly illogical. Each KNL core has 128B/cycle of bandwidth to the L1D cache. Based on Intel’s design choices in Haswell and Knights Corner, this strongly suggests that a shared L2 cache must provide at least 64B/cycle per core, for a total of 128B/cycle per L2 tile. This means the L2 cache will be heavily banked and deliver two cache lines per clock. This focus on bandwidth is important, but also has negative implications for density. In an ideal world, the L2 cache would be 512KB/core (i.e., 1MB total per tile) – providing the same cache capacity as in Knights Corner. As a practical matter though, this is not feasible with 72 cores. If the total L2 capacity is 36MB, that would make it impossible to use an inclusive L3 cache (which is likely for reasons we will discuss later). As a general rule of thumb, an inclusive cache (e.g., the L3) must be at least 8× larger than the included caches (e.g., the L1s+L2s). A 288MB L3 cache would consume nearly 500mm2 on 14nm and is not a feasible option.
The most likely configuration for the L2 cache is 512KB per tile; assuming that the instructions are shared between both KNL cores, this works out to an effective size of slightly better than 256KB/core. Intel has used 256KB L2 caches for nearly every large core from Nehalem through Haswell, so there is a historical precedent as well. Moreover, the only logical alternative is a smaller 256KB L2 cache per tile (128KB/core), but that seems too small to be effective.
Discuss (385 comments)