Knights Landing Details

Pages: 1 2

Knights Landing Fabric and System Infrastructure

Moving one step outwards, the next major question is the nature of the fabric and system infrastructure in Knights Landing (and Skylake). The rumors suggest a 2D mesh-based interconnect, 8 memory controllers for on-package eDRAM, 6 channels of DDR4 memory and 36 lanes of PCI-E 3.0. Each of these elements seems quite reasonable, but again, there are crucial details missing.

Like its predecessors, Skylake-EX will use the QuickPath Interconnect for coherent communication between multiple processors, presumably with support for 2-8 sockets. Since Skylake-EX and Knights Landing will presumably share system infrastructure, this means that Knights Landing could take advantage of QPI – and there are a number of reasons to believe that this will happen.

First, Intel plans to offer variants of Knights Landing that are tightly integrated with the low latency Aries interconnect fabric (acquired from Cray in 2012). While the rumors suggest that the Aries ASIC (code named Storm Lake) will connect to Knights Landing via PCI-E, it would be far more intelligent to use QPI. The latency difference between the two is quite significant; PCI-E roundtrip latency is roughly 1µsecond, compared to 40ns or less for QPI. Ultimately, this translates into system scalability and higher performance.

Second, while Knights Landing can act as a bootable CPU, many applications will demand greater single threaded performance due to Amdahl’s Law. For these workloads, the optimal configuration is a Knights Landing (which provides high throughput) coupled to a mainstream Xeon server (which provides single threaded performance). In this scenario, latency is critical for communicating results between the Xeon and Knights Landing. This is also a huge competitive advantage over Nvidia GPUs, which do not have QPI and therefore must rely on PCI-E for connecting to the host Xeon processor.

Third, QPI is cache coherent meaning that Knights Landing could easily share data with a host Xeon processor; in contrast, GPUs must explicitly send and receive data over PCI-E which is both slow and forces an unnatural programming model upon developers. Migrating code to a cache coherent accelerator is vastly simpler than using a GPU or FPGA. In essence, this lets Intel sidestep the entire problem that Nvidia is facing with CUDA – porting code to a new architecture is a massive investment of time and energy.

Last, it is possible that some systems might directly connect multiple Knights Landing processors together via QPI. While this seems unlikely, using a 4S Knights Landing building block could reduce the complexity of the system interconnect.

Assuming that Knights Landing is configured with QPI, then it is only logical to assume that it will also feature a large last level cache (LLC). As with nearly all Intel server designs, the LLC will be inclusive and shared by all cores – thus it acts as a snoop filter, reducing coherency traffic and improving system scalability. The total size of the LLC should be around 144MB or 2MB/core, which corresponds to roughly 250mm2 on a 14nm process. While this is fairly large, Knights Landing is probably a 700mm2 device, and spending about a third of the die area on cache is a very reasonable design choice as discussed in an earlier article on microservers. Assuming 256KB L2 cache per core, the inclusive LLC is unlikely to be any smaller than 2MB/core to maintain an 8:1 capacity ratio. The LLC organization is unknown, but one possibility is 36 distributed slices that are accessed in parallel and can supply 32B/clock for each KNL core. Another alternative is that the LLC is partitioned along the lines of the eDRAM or memory controllers.

The Knights Landing fabric is rumored to be a 2D mesh, which is consistent with Intel’s earlier research directions on the Terascale project. However, no further details are available, although this suggests a non-uniform cache design. The biggest outstanding questions relate to the interactions between the fabric, tiles, and the LLC, for instance:

  1. Are there separate data paths (or virtual channels) for requests, acknowledgements, snoops and data?
  2. What kind of flow control is used?
  3. What kind of communication flows are packetized versus circuit switched?
  4. How are L2 misses handled?
  5. How much does LLC latency vary?
  6. Are there directories for the LLC slices?

When it comes to the eDRAM and DDR4 memory, the rumors seem both straightforward and accurate. According to the leaked slides, Knights Landing has 16GB of eDRAM that delivers >500GB/s of memory bandwidth. While this might seem excessive (in terms of capacity), it has the advantage of providing excellent performance compatibility with the previous generation. Knights Corner uses 16GB of GDDR5, with 352GB/s of memory bandwidth; the eDRAM for Knights Landing guarantees that any workload with a smaller (<16GB) data set will see a significant performance gain. Workloads with >16GB of data can take advantage of the DDR4 memory in Knights Landing, which is rumored to have a maximum capacity of 384GB. The only real question about the eDRAM is the organization. Since the eDRAM can reportedly be used as a cache, it must have tag arrays, and it is unclear exactly where the KNL designers managed to put tags for 16GB of cache.

Summary

The rumors and leaks about Knights Landing are an excellent starting point for discussion. The leaked information is neither complete, nor fully accurate and probably comes from outdated presentations. But many of the rumors hang together correctly and help to draw an overall picture that careful analysis can complete.

Estimated Knights Landing performance comparison

Table 1. Estimated Knights Landing performance comparison

Table 1 shows estimates of the critical characteristics of the 14nm Knights Landing, compared to known details of the 22nm Knights Corner, Haswell, and Ivy Bridge-EP. The estimate of Knights Landing differ from the rumored specifications primarily in the capacity of the shared L2 cache, which is estimated to be 512KB, rather than 1MB. It is possible, although extremely unlikely that the shared L2 cache is 256KB. The analysis also incorporate several other critical factors which were not mentioned in any rumors, specifically cache read bandwidth and the large shared L3 cache. The L3 cache is estimated as eight times the size of the L2 caches or 144MB (in the unlikely scenario that the L2 cache is 256KB, then the L3 cache is likely to be proportionately smaller).

If the analysis is correct and Intel can execute on these plans, then Knights Landing will be a technical tour de force when it is released, probably in 2015. We estimate that Knights Landing will improve raw FLOP/s by >2.5×, but the most significant changes are in the memory hierarchy where cache bandwidth has jumped even more and the on-die capacity has increased by nearly a factor of 5. The single threaded performance will also be substantially higher with the move to the out-of-order core derived from Silvermont, enabling many more workloads to stay resident on Knights Landing. In fact, it is quite possible that the x86 core may be able to turbo to significantly higher frequencies when the vector units are inactive (e.g., on scalar, latency-sensitive code), further boosting single threaded performance.

The cost of all this performance is silicon die area. To date, the largest chip that Intel has manufactured in volume is Tukwila, which was a staggering 700mm2. Knights Landing is probably about the same size and possibly even bigger (if Intel has figured out how to increase the reticle limit). On top of that, the data arrays alone for 16GB of eDRAM are 1000mm2, once the control logic and I/Os are accounted for, that is likely to be 1200mm2 (albeit spread across multiple chips).

There are two clear takeaways. First, the level of investment is quite spectacular and demonstrates that Intel considers the HPC market to be absolutely vital and will not let Nvidia’s advances go unchecked. Second, the significant gains in performance are much larger than the 22nm to 14nm transition can explain; this implies that Knights Corner suffered from a number of challenges. The most reasonable theory is that the Knights Corner team was limited by the available resources and time to market. Knights Landing leverages Intel’s investments (e.g., eDRAM, Skylake fabric and uncore) much more intelligently and the overall product is much more closely tailored to the needs of the market.

Our analysis of Knights Landing is by nature speculative, as it is based upon rumors and leaks. However, it is also highly informed and rational and adds significant clarity to the publicly available information. The reality is that Knights Landing will not be publicly described until Hot Chips or SuperComputing later in 2014. Until then, key customers will know the truth and the rest of the world will keep on guessing. What is apparent though is that Intel takes HPC quite seriously and is willing to make tremendous investments to succeed in this market – while Intel’s competitors may not welcome a competently executed product, HPC customers will certainly reap the benefits and we will all look forward to benchmarks later this year.


Pages: « Prev  1 2  

Discuss (403 comments)