IBM RISCs It All
The aforementioned binary translation requirement helps to explain why IBM chose to pursue a high frequency design. Earlier roadmaps showed the POWER6 at 4.8GHz in the third quarter of 2006, with a P6L (light) variant clocking at 5.5GHz in the second quarter. According to this roadmap, mainframes based on the Z6/eCLipz running at 4.4GHz are expected in the third quarter of 2007. More recent documents have indicated that the target clock speed for the POWER6 has slipped to 4-4.5GHz and that the release date has been pushed back to 2007.
IBM presentations have indicated that the POWER6 will include fixed point decimal support. This was probably another feature added to facilitate mainframe applications, many of which use decimal representations. Binary representations, while faster and more storage efficient, are unable to exactly express some fractions, such as a tenth. Financial institutions simply cannot afford the luxury of approximation and have come to rely on mainframes and other systems that support decimal arithmetic. While this problem might have been solved through binary translation, it likely would have had significant performance implications. The POWER6 will also support the VMX extensions, although it is unclear how many Linux and AIX applications actually use VMX. Open source applications usually rely on the compiler for SIMD support, and only a few very low-end AIX systems run on VMX capable hardware. One possibility is that IBM intended to sell the POWER6L to Apple for desktop machines before Apple adopted Intel MPUs. An older presentation also indicated that IBM will improve the POWER6’s virtualization capabilities, possibly making the ISA fully virtualizable. Again, this makes sense as virtualization is one area where mainframes are far ahead of UNIX machines. Furthermore, IBM is also working on ViVA-2, the successor to the existing Virtual Vector Architecture, for the POWER6. ViVA lets HPC software treat multiple POWER5/6 MPUs as a single vector processor, which makes it easier to achieve high performance for certain parallelizable workloads; this is distinct from VMX, which is just a SIMD extension to the instruction set.
Our sources have indicated that the POWER6 will be a deeply pipelined 4-issue CPU, with OOO capabilities that are more along the lines of the 604e rather than the POWER5 or Pentium Pro. Most likely, the POWER6 will be a dual core device, although there is a very slight chance it may be a 4-way CMP. The POWER6 is not as wide as the POWER5, which could issue 8 instructions but was in essence a 4+1 wide CPU (four instructions and a branch), but this sacrifice of IPC was needed in order to achieve high frequencies. Similarly, the re-ordering capabilities were probably scaled back in order to achieve the necessary frequencies. The POWER6 will also use simultaneous multi-threading (SMT), most likely with 2 threads. Using 4 threads would require 128 GPRs and 128 FPRs, plus rename registers. The EV8 had a total of 512 physical registers and had to use 3 pipeline stages for register file access, and that was for a 2GHz frequency target. Certainly, things would only get worse at 4GHz. While the SMT is not indicated on any roadmaps, there is quite a bit of evidence to support the notion. First, SMT would certainly help to make up for the POWER6’s limited re-ordering capabilities. Second, IBM already has invested a lot of time and money tuning operating systems, tools and software for multi-threading, so the incremental costs should be small. Last, almost all of IBM’s recent designs have included multithreading, the POWER5, the Xbox360, and the CELL PPE. All together, the evidence seems to suggest that IBM still strongly believes in SMT.
Another interesting aspect that our source mentioned is that IBM was placing a much stronger emphasis on circuit design than it had in the past; no doubt reaching 4.4GHz in a 65nm process will require a bit of circuit wizardry. An IBM VP has publicly stated that the POWER6 will use around 750M transistors, and presentations have indicated that the bandwidth for the POWER6 has doubled (probably to 32GB/s). The L1D cache for the POWER6 will be 64KB, 8-way set associative capable of two reads at up to 5.6GHz, according to a paper from ISSCC. Given that each POWER6 core probably uses 25-60M transistors, that leaves anywhere from 630-700M transistors for caching and system infrastructure (routing, directories, memory controller etc.). Realistically, this means that the POWER6 sports 6 to 12MB of cache, depending on how many transistors are used for system functionality. Various presentations have indicated that the POWER6 will have private L2 caches (like Montecito and the K8), which suggests that there will be a high speed link between the two caches. As with prior generations, the L3 cache will be off-die, but it seems most likely that the L3 controller and tags will reside on-die.
There are three papers from IBM at next year’s ISSCC, one of which details a 5GHz clock distribution network. The other paper describes a 4GHz+ single stage 64b fixed point ALU with a depth of 13 FO4. This implies that IBM seems to have fixed the infamous 2 cycle latency that was a result of a floor planning mistake in the POWER4/5. The authors indicate that dependent operations can be performed back to back, without inserting pipeline bubbles; no doubt IBM’s compiler writers are thanking their MPU counterparts for this. The same paper also details a 7 stage double precision FPU with a depth of 91 FO4, and forwarding after the 6th cycle. The last presentation concerns the L1D cache, which was discussed previously.
Discuss (61 comments)