Knights Landing Options
In the last few years, Intel has moved towards a system-on-chip (SoC) design philosophy that mirrors the industry at large. While critical IP blocks such as CPU cores are custom designed and placed as hard macros, there is a growing emphasis on re-use across product lines. Perhaps the most obvious example is that the Silvermont core is used in smartphones, tablets, and microservers (respectively codenamed Merrifield, Bay Trail, and Avoton). Consequently, the first step to figuring out what core is used in Knights Landing (KNL) is to understand the options available to Intel’s architects.There are at least four publicly known cores that are available to Intel’s architects as building blocks for the 14nm Knights Landing:
- Big cores, such as Haswell and Skylake
- Small cores, such as Silvermont and Goldmont
- The Knights Corner core, a derivative of the P54C
- The Quark core, a derivative of the 486
While the earlier generation Atom core (known as Bonnell) is available, it has not been productized on 22nm or 14nm. Similarly, Intel has a number of older cores (e.g., Nehalem), but those have not been taped out on a modern process, and the microarchitectures are also quite challenging. Specifically, all of Intel’s CPU cores prior to Sandy Bridge use a re-order buffer that holds all data – integer, 80-bit FP, and 128-bit SSE. This means for 32-bit integer data, the majority of the register is wasted, since it must be sized to hold a full SSE value. Extending this microarchitecture to 512-bit AVX3 registers is highly inefficient, compared to the data-less re-order buffer and physical register file used in Sandy Bridge and later cores.
There are probably a number of other CPU cores under development at Intel that are not publicly known, making it difficult to enumerate a full set of options. However, these four cores are potential starting points and would all have to be augmented in various ways to really serve the goals of the Knights Landing program.
The big cores such as Haswell and Skylake are somewhat attractive, but come with significant baggage. On the plus side, they are clearly bootable and provide full compatibility with modern x86 programs. The single threaded performance is best-in-class and a hypothetical Skylake-based KNC would have amazing performance on workloads with mixed parallelism.
In many respects though, the latency optimizations in a big core that improve single threaded performance are also serious drawbacks. Haswell, Skylake and other big cores are optimized for high frequency (e.g., 3.9GHz), at the cost of density and power efficiency. In contrast, a throughput architecture like Knights Landing emphasizes power efficiency over frequency. It is more efficient to operate at 1-2GHz and a lower voltage, and simply increase the core count to scale up performance – Knights Corner runs at roughly 1GHz.
Haswell and Skylake have lavish branch predictors and prefetchers which are critical for handling spaghetti integer code, but do very little to contribute to throughput workloads. The excellent out-of-order execution (OOOE) burns significant power in structures such as the 60 entry unified scheduler, which must be searched every cycle for operations that can execute in parallel. OOOE is extremely useful for integer instructions and memory accesses, but is less useful for re-ordering FP instructions. Most throughput architectures save this power by switching between threads to tolerate delays (e.g., due to cache misses). Haswell employs two simultaneous threads to hide memory latency. Ideally, a KNL core would have 4-8 threads and far less OOO resources. Similarly, features like transactional memory are not likely to be useful for Knights Landing and represent an added implementation cost.
Lastly, the cache hierarchy and fabric for the big cores (private L1, L2 and shared L3) is similar, but subtly different from a throughput architecture. The good news is that the Haswell and Skylake cache hierarchies are quite high bandwidth; Haswell can read 64B and write 32B of data per cycle to the L1D, and the L2 cache can transfer a full 64B line in a single cycle.
However, throughput architectures generally use algorithms that explicitly partition data, and minimize data sharing, while taking advantage of shared instructions. As a result, a flatter and more bandwidth optimized hierarchy is ideal and the shared LLC in Haswell is not particularly useful. KNC uses private L1 caches with inclusive L2 caches; the cores are connected via a ring-based fabric. In contrast, the LLC in Haswell is inclusive, while the L1 and L2 caches are non-inclusive/non-exclusive. The ring-based fabric is similar to the one used in Haswell, but scaled to vastly higher core counts and each core has directories for the L2 cache lines to reduce snoop traffic. Each L2 cache is also substantially smaller than the LLC slices for Haswell (512KB/core vs. 2MB/core).
Lastly, the single threaded performance in a big core is overkill for the market. There are workload benefits to single threaded performance, since many applications combine serial and parallel kernels. Increasing single threaded performance means that more kernels can stay on Knights Landing, avoiding expensive data copies. However, the real goal for Knights Landing should be ‘sufficient single threaded performance for most kernels’, whereas single threaded performance is the primary goal for a core like Haswell or Skylake that targets client systems.
Discuss (93 comments)