Haswell System Architecture
There will be a number of implementations built around the Haswell core, ranging from low power SoCs to servers. The majority of the system architecture is product specific and was not described. The interconnect is still a wide ring bus, with 32B/cycle for every stop, and there are a few other general details available.
First, Haswell’s Last Level Cache (LLC) has been enhanced for performance. Each slice of the LLC contains two tag arrays, one for data accesses and one for prefetching and coherency requests. The data bandwidth is still 32B/cycle, since the ring bus has not widened. The memory controller also has a larger re-ordering window, yielding better throughput for write data.
Second, the ring and LLC are on a separate frequency domain from the CPU cores. This enables the ring and LLC to run at high performance for the GPU, while keeping the CPUs in a low power state. This was not possible with Sandy Bridge, since the cores, ring and LLC shared a PLL, and wasted power on some graphics heavy workloads.
Virtualization has improved for Haswell, with a particular emphasis on eliminating VM exits. The extended page tables incorporate accessed and dirty bits, which reduce VM transitions and the new VMFUNC instruction enables VMs to use hypervisor functions without an exit. The round-trip latency for VM transitions is now below 500 cycles, and the I/O virtualization page tables are a full 4-level structure.
While Intel has demonstrated substantial improvements in idle and active power for Haswell, there was not sufficient detail for a comprehensive discussion. It is likely that this information will only be available when products come to market, since power management is implementation specific.
Conclusions and Analysis
Looking back over several generations of Intel microprocessors, the changes are remarkable. Merom marked Intel’s ‘right hand turn’, acknowledging that the Pentium 4 was not a viable long term solution because power efficiency is crucial to success. Starting from Merom, Intel’s design teams embarked upon a relentless journey of continuous improvement with each major architectural change.
Intel’s Sandy Bridge core served as an impressive starting point, with unmatched performance in the x86 ecosystem. Haswell builds on this foundation, with powerful ISA extensions and a substantially more aggressive execution core and cache hierarchy. Moreover, Haswell is the first Intel core that will take full advantage of the 22nm FinFET process technology. While the Ivy Bridge graphics architecture is new, the CPU core was mostly unchanged. More importantly, the circuit design was focused on a low-risk and faster migration to a new process, rather than achieving peak performance, efficiency or density.
AVX2 doubles integer SIMD to 256-bits, while FMA doubles the number of operations for FP by chaining a multiply and add. Crucially, AVX2 also encompasses instructions for gathering non-contiguous data from memory, which aids compilers and programmers using the x86 SIMD extensions.
Intel’s Transactional Synchronization Extension has a less obvious impact on performance, but is more powerful and pervasive in the long run. The Hardware Lock Elision separates functional correctness from performance, so that programmers can focus on correctness while the hardware optimizes for performance. On the other hand, Restricted Transactional Memory offers developers a new programming paradigm for concurrency that is far easier and more intuitive, while improving performance for multi-threaded software.
Turning to the microarchitecture, the Haswell core has a modestly larger out-of-order window, with a substantial increase in dispatch ports and execution resources. Together with the ISA extensions, the theoretical FLOPs and integer operations per core have doubled. More significantly, the bandwidth for the cache hierarchy, including the L1D and L2 has doubled, while reducing utilization bottlenecks. Compared to Nehalem, the Haswell core offers 4× the peak FLOPs, 3× the cache bandwidth, and nearly 2× the re-ordering window.
Overall, we estimate that a Haswell core will offer around 10% greater performance for existing software, compared to Sandy Bridge. For workloads using the new extensions, the gains could be significantly higher. In theory, AVX2 and FMA can boost performance by 2×, but the impact on most vectorizable workloads will be much lower. Research from AMD has shown that lock elision gains 30% for the right workloads, although the benefits depend strongly on the actual concurrency.
Competitively speaking, Intel is already far ahead of AMD in terms of CPU performance. In 2013, Haswell will be matched up against the Steamroller core, which is a heavily redesigned derivative of Bulldozer. Steamroller still shares the instruction fetching, FP/vector cluster, and L2 cache between cores, but has dedicated decoders that are wider. Realistically, the performance gap should narrow given the scope of opportunities for AMD to improve, but Haswell will continue to have significant advantages.
Haswell will be the first big x86 core to compete against ARM-based cores in tablets. While the performance will be dramatically higher, the power budgets are very different. Haswell SoCs will reach 10W, while competing solutions are often closer to 4W. The real question is the relative efficiency of Haswell SoCs, and the advantage of the massive x86 software ecosystem. Fortunately, Windows 8 provides an opportunity to accurately measure performance and efficiency. The results will inject some hard data into discussions that have been otherwise vacuous and largely driven by marketing.
In summary, Haswell is a superb new architecture that will carry Intel into new markets and a new era of competition, not only from AMD, but also the ARM ecosystem. Ultimately, products will reveal the performance and efficiency advantages of the Haswell family, but the architecture looks quite promising, a testament to Intel’s design team.
Discuss (106 comments)