EPIC, Eastern Style
The architecture of the E2k is a variable length VLIW design, similar in some ways to the Explicitly Parallel Instruction Computing, or EPIC, label HP and Intel created for its upcoming IA-64 processors. Like the Merced/Itanium, the E2k is essentially a six issue wide machine. It divides its six execution units into two groups of three, and replicates the 256 entry by 64-bit unified register file in both groups in a similar manner to the arrangement of the four integer units in the Alpha 21264/EV6 core into two clusters. Unlike the EV6, with its dual ported L1 data cache, the E2k divides its L1 data cache into half and allocates each half to one execution unit group. The small 8 Kbyte, direct-mapped L1 data caches are reminiscent of the Alpha 21164, and allow a load-use latency of only two clock cycles. The organization of the E2k processor is shown in Figure 1.
Figure 1. Organization of Elbrus E2k
The E2k’s use of a unified integer and floating point register set, and combined integer and floating point ALUs is similar to Sun’s MAJC VLIW processor design. All six ALUs support integer instructions, while two of the three ALUs in each cluster support floating point add and multiply, along with integer multiply and MMX type SIMD instructions. Four of the ALUs support load and store operations (including two of the floating point ALUs), while only one ALU supports floating point and integer divide.
The E2k’s register file is windowed in a circular fashion similar to the scheme used in the AMD 29K architecture. This windowing adds an extra stage to the E2k’s execution pipeline, shown in Figure 2, to perform the required addition of an offset value to the logical register numbers, specified in instructions, to generate a physical address to access the register file. Like the autonomous register stack engine (RSE) in IA-64, the adjustment of the E2k register window during procedure entry or exit is performed automatically, along with any required data movement to or from memory, to handle register window overflows or underflows respectively.
Figure 2. Basic E2k Pipeline
To reduce branch overhead, the E2k implements branch operations in two parts. A ‘prepare to branch’ instruction, which calculates the target address and initiates instruction prefetch, can be placed ahead of a conditional branch instruction in the code sequence to help reduce the pipeline bubble associated with a mispredicted conditional branch or a branch to a calculated address from 8 to 4 cycles. The E2k supports up to three outstanding prepare to branch operations. This technique is similar to the split branch feature in the Hitachi SH-5 64-bit processor for embedded control applications. The E2k supports predicated instruction execution in similar fashion as IA-64, albeit in a more restricted form. Predication allows conditional branches to be eliminated through such compiler code generation techniques as if-conversion. The E2k supports 32 predicate bits, which may be used in either a true or inverted fashion, to control instruction execution. Up to four predicates can be specified within a single VLIW instruction. There is also support for explicit speculative code execution. Each syllable within a VLIW instruction has a special flag bit which is explicitly set to invoke speculative execution of that operation.
Be the first to discuss this article!