Intel’s small cores, such as the Silvermont core represent another option that seems reasonable, however a deeper analysis shows a fundamental mismatch.
Looking at the positives, Silvermont, Goldmont and their ilk are fully bootable and have good x86 compatibility. The Silvermont core is much closer to the sweet spot for a throughput architecture in terms of balancing power, area, and latency. The single threaded performance is a significant step forward compared to Knights Corner, but not as extreme as the Haswell family. The core is more carefully tailored for the density and power consumption expected in a mobile device, and operates at a peak frequency of 2.4GHz on a 22nm process, compared to 3.9GHz for Haswell. Silvermont’s out-of-order execution also seems to be a more natural fit with the goals for Knights Landing.
Silvermont’s integer and memory clusters are out-of-order, whereas the FP/vector cluster is separate and in-order. For throughput workloads, that is absolutely the right balance. The out-of-order memory accesses enable a stall-on-use design that can overlap multiple memory accesses within a single thread, but avoids wasting power reordering FP and vector operations which rarely stall. In theory, the Silvermont FPU could be replaced with a full-blown 512-bit FMA unit with fewer disruptions to the core pipeline. However, that is a deceptively simplistic view that does not hold up to reality as we will see shortly.
The first challenge with Silvermont is that the core is not multithreaded at all. Adding 4-8 threads is a significant endeavor that would require redesigning substantial portions of the pipeline to add registers and control logic, not to mention the inherent challenges in validation. More importantly, the entire memory hierarchy for Silvermont is utterly inappropriate for a throughput architecture like Knights Landing. The 24KB L1D cache is optimized for integer latency, and is limited to 16B/cycle read + 16B/cycle write bandwidth. The KNL core needs at least 64B/cycle read + 64B/cycle write to match the previous generation and saturate a single vector FMA pipeline. Moreover, the L1D in Knights Corner was 32KB and 8-way associative to work with 4 threads – the 6 way associativity in Silvermont is simply insufficient and would cause excessive conflicts. The Silvermont load/store pipeline is also more tightly coupled to the integer execution units to reduce latency, whereas in Knights Landing the wire delay to the vector unit is probably more important from a power and performance standpoint.
Moreover, the L2 cache in Silvermont is a very poor fit for Knights Landing. First, it is shared between two adjacent cores which does not match well with the data partitioning expected in a throughput architecture. Second, the L2 cache bandwidth is a mere 32B/cycle, or 16B/cycle per core – a quarter of what is necessary. Just as with the L1D cache, increasing the bandwidth by 4× is quite complicated. Third, the L2 cache in Knights Corner is inclusive to prevent coherency traffic from disturbing the L1 caches. In contrast, Silvermont has a non-inclusive/non-exclusive policy with respect to the L1 caches since there is relatively little coherency traffic in most Silvermont-based designs. Last, the fabric for Silvermont is poorly matched for the bandwidth and scalability needs of Knights Landing. It is a switched fabric with 16B read + 16B write per cycle that scales to 4 clusters, with each cluster comprising two cores and a shared L2 cache. In contrast, Knights Landing will have somewhere between 64-128 cores, and the existing ring-based interconnect has substantially more bandwidth.
Turning to several systemic issues, the reliability necessary for Knights Landing is much higher than for Silvermont since it will be deployed in systems containing 10K or more nodes. This has an impact on pervasive, core-wide functionality such as the Machine Check Architecture, as well as the soft error protection for latches and arrays spread throughout the core. Another systematic factor is that the small cores are generally optimized for low idle power, whereas that simply is not an important design point for a throughput design that will target 200-300W products. Active power is a far more important characteristic for Knights Landing.
The real problem with the small cores is that the cumulative weight of the changes is too great. As discussed above, the entire memory hierarchy would have to be redesigned, as well as the FP and vector unit. The out-of-order control and scheduling logic would need to be heavily adapted to support multithreading. At that point, the only stone left unturned is really the integer execution units, which are a small portion of the overall design.
The third readily available option is the KNC core, which represents the path of least resistance. The KNC core is largely an enhanced version of the P54C augmented with a 512-bit vector unit. It offers higher single threaded performance than the scalar cores in Nvidia or AMD GPUs, but still leaves much to be desired for many workloads. It was selected largely for availability and time to market, rather than any sort of optimal design for a throughput architecture. The KNC core is not bootable, and is a simple stall-on-miss design that only sustains one load miss per thread. While it could be enhanced to be bootable, it would be a better idea to invest the effort in a new design that is more consistent with the overall project goals for Knights Landing. More outstanding load misses, higher single threaded performance, and greater efficiency are all quite feasible in a new design.
The last major option is the Quark core, which is basically a modernized 486 microarchitecture with a slightly more modern (but still 32-bit only) instruction set. There is no scenario in which this is a viable option, given the 32-bit limitation and minimal single threaded performance.
A Custom Core for Knights Landing
Looking at the options available to the architects for Knights Landing and it becomes clear that none of the four existing cores are a great fit with the project goals. The existing Knights Corner core and Quark are simply non-starters. Big cores like Haswell and Skylake are a reasonable fit in terms of the memory hierarchy and scalability, but are too heavily optimized for single threaded performance, which costs in terms of power and area efficiency and ultimately throughput. On the other hand, the small cores like Silvermont and Goldmont are too difficult to adapt to high throughput and and large core count applications.
In our estimation, the only option to achieve the right balance of single threaded performance, throughput, power efficiency, and reliability is for Intel’s architects to design a custom core for Knights Landing. In the next part of this series, we will explore what the Knights Landing core might look like in detail.
Discuss (93 comments)