The POWER of Convergence
For a long time now, the server market has been consolidating under pressure from economic forces. This inexorable drive has left many casualties along the way; processors and systems that lacked the necessary volumes yielded to cheaper and ultimately faster products. Most platforms that did survive, such as OpenVMS, now share hardware with UNIX operating systems to better amortize the development costs. IBM took this same approach with the iSeries; OS/400 runs on a basically unmodified pSeries systems based on the POWER5. In the case of OS/400 this was relatively easy due to the fact that the underlying instruction set is relatively similar to intermediate code, so only select parts of the OS had to be explicitly ported to PPC. The common hardware between iSeries and pSeries is extremely important from a financial perspective. Developing a new processor costs at least $40 million. A new system (excluding the MPU) requires multiple ASICs, chassis and interconnects and can be just as expensive; for example, IBM’s X3 servers required $100M to develop. Now, all the development costs are shared by both iSeries and pSeries, and the volume for the underlying hardware increases, which is essential in the semiconductor business. In essence, this means that the only unique costs for the iSeries are in software, which is a very nice situation for IBM.
Even today, some hardware is shared between the major IBM product lines. For example, the IOMMU (or Translation Control Entries in IBM terminology) used in the various servers is identical, however operating system support for xSeries IOMMUs lags that of the pSeries or zSeries. However, current zSeries systems are significantly different from other systems. Just to give a small flavor of the differences:
- Support for decimal numbers, in packed and region formats
- Support for IBM’s hexadecimal floating point
- System Assist Processors (SAPs), identical to the CPU, but used exclusively for managing I/O
- Coprocessors for everything from cryptography to Java
The most basic zSeries system contains a MCM with 16 chips, one is used as a system timer, one as a storage controller, two for memory control, four as L2 caches, and then the remaining 8 contain 12 CPUs (two of which are configured as SAPs).
One truthful rumor is that elements of the eCLipz project will be powering all three proprietary systems, according to IBM roadmaps. However, to suggest that all of the zSeries functionality could be consolidated to a single chip is rather unlikely even in a 65nm process. IBM mainframes, like HP’s NonStop systems are clearly different enough that rebadged UNIX boxes will not suffice. Moreover, the far higher volume pSeries systems do not require the same level of infrastructure, therefore it seems unlikely that IBM would burden their higher volume designs with excessive hardware (think about the financial corollary to Amdahl’s law: make the common case cheap (iSeries and pSeries) and the uncommon case functional (zSeries)). Based on IBM roadmaps, the most likely scenario is that the Z6 processing units (IBM terminology for a mainframe CPU) for the next generation are identical or nearly so the POWER6. In that situation, the mainframe itself would use several other support chips, which may be part of the eCLipz project as well.
The designers of the POWER6 were faced with a rather unusual set of goals and requirements. IBM has little intention of relinquishing their leadership in this market, so UNIX and Linux system performance figured prominently on this list. RAS features are also essential, even more so than in past generations, as the POWER6 would need to be on par with existing mainframes. The most difficult requirement is achieving high performance for zSeries applications under binary translation (BT). Binary translation is dynamic emulation of an instruction set (zArch in this case) by another instruction set (PowerPC). Transmeta used this technique to run x86 applications and operating systems, and most interpreted or JITed languages, like Java or .Net can be viewed as binary translation from byte code to the host instruction set. Unfortunately, the process of binary translation usually reduces the amount of parallelism in the instruction stream because the translator only looks at a small snippet of code at a time, whereas a compiler can examine the entire application. Consequently, a very wide issue CPU like the POWER4/5 will have little parallelism to work with and relatively low performance. Narrower, faster CPUs are a much better fit with a workload that involves binary translation; both out of order execution and simultaneous multithreading will help as well.
Discuss (61 comments)