The PA6T Processor Core
The PA6T is a deeply pipelined, out of order, superscalar core, which is fully PPC v2.04 compatible, bi-endian, with hypervisor support and the VMX SIMD instructions (Altivec). While PowerPC is a RISC instruction set, it includes quite a few CISCy instructions. In the POWER4/5 and PPC970, these instructions were handled through a combination of microcode and software traps. However, software traps are somewhat more problematic in the embedded world, where real time applications often require guaranteed latencies. The PA6T executes most instructions natively, but the more complicated ones are broken into multiple micro-ops, similarly to the P6, P4 and K8 cores.
Each PA6T core is capable of fetching up to 4 instructions per cycle, issuing 3 and retiring 4. The scheduler can support up to 64 in-flight instructions, and each one is individually tracked. While other PowerPC implementations have used VLIW-like groups to track instructions, the group structure complicates exception handling and is unsuitable for the embedded world. Each cycle, up to 3 instructions are issued to an integer unit, a branch unit that also can handle integer instructions, a double precision FPU (that supports multiply accumulates), a simple VMX unit that handles simple integer and permute SIMD instructions, and a complex VMX unit that executes floating point and complex integer SIMD instructions. The basic integer pipeline is 14 cycles, and the branch misprediction penalty is 12 cycles, as shown below in Figure 3, a combined pipeline and block diagram.
Figure 3 – Pipeline and Block Diagram for the PA6T (modified from a P.A. Semi presentation)
The PowerPC instruction set has a somewhat byzantine memory organization, so there are some terms in the above diagram that could use clarification. In particular, the ERAT (Effective to Real Address Translation) is somewhat like a L1 TLB, and as the diagram shows, there are separates ones for data and instruction space. The SLB is the segment lookaside buffer, which is also similar to a TLB and is also used in the rather complicated effective to virtual to real address translation process. The block labeled HTW is the hardware page table walker which handles TLB misses.
Each PA6T core contains a 64KB instruction cache, and a 64KB data cache, both of which are 2-way set associative and ECC protected. The data cache has a single port, providing up to 32GB/s of bandwidth with a 4 cycle load-to-use latency (remote L1 accesses have a 30 cycle load-to-use latency). The L2 cache is shared by all the cores, and is accessed through the on-die CONEXIUM interchange. Since the number of cores varies, it is entirely logical that the L2 cache size was similarly designed to scale from 512KB to 8MB. The first system to ship, the PA6T-1682M will be using a 2MB, 8-way associative cache that is ECC protected. The L2 cache has both a read and write port, so it can sustain 32GB/s of bandwidth and has a 22 cycle load-to-use latency. In total, the memory hierarchy is capable of supporting up to 16 in-flight requests. Moreover, since many embedded applications are designed and written for a specific architecture, the PA6T’s L2 cache is fully programmable and allows explicit management to override the ordinary hardware management algorithms.
For all CPU designs, the branch prediction is a key element; however, for low power designs, it is even more essential. Mispredicted branches waste a substantial amount of power, because any work done past the branch is squashed and cannot be used. As a result, the PA6T is lavishly equipped with both a 16-entry next line predictor and a tournament predictor that uses a 32Kbit branch history table.
The entire PA6T CPU is heavily clock gated with over 5,000 domains, which helps to keep per core typical power at 4W and worst case at 7W.