IBM z196 Mainframe Architecture

Pages: 1 2 3 4 5 6 7 8

Performance

The z196 is the fifth generation of IBM’s 64-bit mainframes and continues to focus on large scale system performance and reliability. The z196 is the first fully out-of-order zArchitecture microprocessor as shown in Figure 6. The z196 is also the first mainframe to take advantage of IBM’s eDRAM on a logic process. From a system standpoint, the latter is particularly significant since the eDRAM is reliable enough that the on-chip L3 cache can hold dirty cache lines – this eliminates off-chip write traffic and saves considerable power and performance.

Interestingly, IBM’s mainframes have always pursued extremely high frequencies and the z196 uses extensive hardware tuning of the clock tree to achieve 5.2GHz. One of the key motivations is the mainframe software ecosystem, which is extremely conservative. New instructions and microarchitectures require that software be recompiled, but frequency benefits all applications. The other factor is that IBM’s MCMs are uniquely capable of cooling well beyond the level of commodity systems. Even when aggressively cooled, the ~250W z196 dissipates over 70W of leakage, which is more than the total power budget for many commodity server processors.


Figure 6. z196 and z10 Microarchitectures

IBM does not publish any standard benchmarks for zArchitecture systems, which makes it nearly impossible to compare against other server systems. However, the company has a performance rating scale for mainframe systems that is used both internally and by customers for system sizing. The Large System Performance Reference (LSPR) is a mix of workloads that is selected to be representative of customer applications and taking into account the different operating environments (e.g. OSes, VMs, etc.) available.

According to IBM, the z196 is about 40% faster than the z10 on existing code. The gains are actually larger for memory-bound workloads, a testament to the improvements in system architecture. The actual LSPR measurements for a 10-way z196 show that it is 32% and 44% faster than a similar z10 system for compute and memory-intensive workloads, respectively. The benefits for compute-intensive workloads that are recompiled should be even higher.

Additionally, systems based on the z196 can scale up to 80 cores and 3TB of memory, whereas the previous generation was limited to 64 cores and 1.5TB. So in theory, the system performance has increased by 75% in a single generation, while keeping energy constant. This is significantly faster than semiconductor scaling would predict; the largest factor is clearly the integrated eDRAM L3 cache, in tandem with the new microarchitecture.

The reliability of IBM’s mainframes is fairly unique and the only other systems that come close are the Itanium-based Nonstop systems. Rather than explicitly calling out the reliability features, they have been described in the appropriate section of the processor pipeline. This reflects IBM’s approach, which builds RAS into the microarchitecture from the start using techniques like the recovery and checkpoint unit.

Conclusions

The z196 is a tremendous step forward for IBM’s mainframes in terms of performance. Just as importantly, it is clear that there is considerable design re-use between zArchitecture and the POWER line. This effort was first touted for the POWER6 and z10, which were fairly similar designs and continues to the current generation. However, the two architectures have diverged more over time, while using common building blocks.

Conceptually, the z196 focuses almost exclusively on integer performance without any multi-threading. In contrast, the biggest changes in the POWER7 were doubling the thread count and improving floating point performance. Both have new out-of-order cores and on-die L3 caches, but the implementations are different. For example, the z196 L3 cache is shared and uniform latency, whereas the POWER7 is highly non-unform and varies by a factor of 5.

In terms of common building blocks, the two continue to share the scalar floating point units. The eDRAM L3 cache and macros are also shared across at least four different projects, including the POWER7, z196, A2 and BlueGene/Q. The z196 macros are smaller (512Kbit versus 1Mbit) and more numerous to achieve higher bandwidth for the store traffic. The external interfaces for coherency and I/O are probably shared as well.

Design re-use is critical to ensure the long-term economic viability of the relatively low-volume mainframes, by sharing development across product lines with distinct target markets. Looking forward, it will be fascinating to see how IBM’s scientific computing R&D gets migrated into mainframes. Techniques such as 3D integration and silicon photonics have obvious applications for big systems. The bottom line is that the z196 is an excellent addition to the zArchitecture family, and leverages many of IBM’s unique capabilities and mainframes will continue to scale up in performance and reliability for the foreseeable future.


Pages: « Prev  1 2 3 4 5 6 7 8  

Discuss (621 comments)