Decode and Group
IBM’s zArchitecture is a big-endian, byte-addressable, 64-bit descendant of the original S/360 ISA. While well thought out, it is definitely a CISC instruction set and relies heavily on millicode (the IBM term for microcode). Millicode is responsible for handling system functions such as communicating with I/O channels, managing the page tables, interrupts, resets and certain RAS features. It is written in assembly using branches and complex control flow and has full access to architectural state, in addition to special millicode-only resources.
The ISA itself shares several features with x86 including variable length and register-memory instructions, so the decoding can be fairly complicated. Incidentally, the IBM mainframe terminology for memory is ‘storage’. So any instruction explicitly touching memory (loads and stores) or implicitly accessing memory (register-memory) is often referred to as a storage instruction.
The overall design of the z196 is based around decoding complex instructions and cracking them into simpler uops which are scheduled and executed by the hardware. This is very similar to Intel’s approach with the Pentium Pro and the techniques used in the POWER4-7. The zArchitecture has 1079 instructions (x86 similarly has more than 1000 instructions) which form several different decoding classes, outlined below:
- 75 instructions, which are only used inside of millicode flows
- 219 instructions, which are executed by millicode
- 24 instructions, which are conditionally executed by millicode
- 16 instructions, which are memory-memory and handled by the load-store sequencer
- 211 instructions, which are decoded into 2 or more uops
- 269 instructions, which are register-memory and decode into 1 uop, but issue as 2 uops
- 340 instructions, which are RISC-like (i.e. register-register) and decode and issue as 1 uop
The z196 includes a variety of new instructions, taking advantage of the flexibility of a variable-length CISC architecture. There are extensions that address the high 4B word of a register independently, effectively treating the register as 2 separate word length registers for arithmetic, rotates, compares and storage instructions. There are also two types of new atomic instructions. The first variety loads, executes and stores the result back to memory, a classic read-modify-write. The execution operations available are addition, logical AND, XOR and OR. The second type of atomic is a load pair instruction, which loads two values from memory to two separate registers, and can detect interlocks with condition codes. IBM also introduced conditional loads, stores and register copies based on condition codes to eliminate unpredictable branches. Lastly, they added a number of non-destructive versions of existing instructions, a popcount and FP to integer conversions.
Like other IBM microprocessor designs such as the POWER4-7, the decoding does not occur in isolation. Instructions are decoded into uops and then grouped together according to rules based on the microarchitecture of the machine. The z196 is wider and more flexible than the dual-issue in-order z10 processor, as shown in Figure 2.
Figure 2. z196 Instruction Decode and Comparison
The actual decoding starts by receiving 1-3 pre-decoded uops (up to 6B in length) and sending them to the decoders. In the z196, there are three decoders which can each emit a single uop per cycle. The group formation rules are complicated, but for the z196, the critical ones are fairly simple. A group can span multiple instructions, but is limited to a maximum of 3 uops. A branch automatically ends a group, to simplify rolling back mispredictions. Additionally, any complex instruction which decodes into two or more uop forms a group by itself.
The groups are formed to ensure hazard free execution and simplify later parts of the pipeline. Each cycle, a single group of uops can be sent to the scheduling queues in the out-of-order pipeline. IBM refers to this process as ‘dispatching’, but others use the term ‘issuing’.
When a millicode entry is detected, the control flow is redirected. The decoders will inject a sequence of 9 uops to enter millicode and begin execution. Once millicode has exited, the decoders restart from the next instruction in the original control flow.
The z10 decode is fairly similar, although only two instructions wide. However, the z10 actually separates decoding from grouping, with issue queues between the two stages. As a result, the z10 group formation is intimately related and synchronous with the scheduling and will be described in the next section.
Incidentally, out-of-order execution substantially simplifies the decoding and group formation by automatically handling many scheduling hazards. The z196 pipeline includes 2 cycles for decoding and 2 cycles for grouping and dispatching. In contrast, the z10 decoding takes 3 cycles, and the grouping is also 3 cycles.
Discuss (621 comments)