The cnMIPS64 Cores
The cores in the OCTEON implement the integer MIPS64 Release 2 instruction set, with several Cavium specific extensions, the result is known as cnMIPS64. The Cavium extensions include arbitrary precision integer multiplication instructions, modular exponentiation as well as instructions for CRC, hashing and other common networking tasks.
The fully custom designed cnMIPS64 cores are in-order, two way superscalar, with a 5 stage pipeline. The simple pipeline can handle ALU, shift and multiplication operations, while the complex pipe can handle all MIPS operations. Figure 2 below shows the processor pipeline.
Figure 2 – cnMIPS64 CPU Pipeline
Each CPU features a 32KB 4-way associative L1 instruction cache and an 8KB fully associative L1 data cache with write-through to maintain coherency. There is also a 2KB fully associative write-back buffer for merging and storing results in transit between the L1D cache and the unified, shared L2 cache. Each CPU includes a memory management unit (MMU), and a 32 entry unified translation look-aside buffer (TLB) that holds information for 64 pages of memory. Supported page sizes range from 4KB to 256MB. Branches are handled by a bimodal predictor with a 512 entry table of 2 bit saturating counters.
The CPU cores also contain a security accelerator for cryptographic functions. The accelerator handles 3DES, AES 256bit, MD5, SHA1, 256 and 512 and lastly, modular arithmetic (GF2 for the mathematically inclined) for RSA and Diffie-Hellman operations.
Each core is heavily clock gated and the peak power for an individual core is 450mW, while running at 600MHz using 1.2V. Of this 450mW, 35% corresponds to the memory, 24% to the issue logic, 16% to the multiplier, 13% to the instruction cache and 12% to the execution logic.
While multithreading was initially considered for the cnMIPS64 cores, it was ultimately discarded. Since packet processing is a data parallel task, there is less ‘dead time’ for multithreading to utilize or latency to hide. Given this type of workload and OCTEON’s memory hierarchy, the designer’s simulations showed that multithreading was not efficient, in terms of performance/watt which is one of the major design goals for OCTEON.