Article: AMD's Jaguar Microarchitecture
By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), April 2, 2014 8:11 am
Room: Moderated Discussions
SHK (nomail.delete@this.mail.com) on April 2, 2014 6:45 am wrote:
[snip]
> Yes, in all the AMD processors i'm aware cmovc has 1 cycle latency, which
> is the "natural" latency for a cmov/select type of instruction.
> What i find strange is in that all Intel's cpu (except P4, which was way worse) cmovcc
> is 2 cycles. Agner's explaination is that all the ins with more than 2 sources are cracked
> in two u-ops and cmov has 3 sources (something like dst:=select(src1,src2,eflags))
> which seems reasonable, but doesn't explain why AMD's cpu doesn't have this problem.
> My 2cent is that cmovcc could be implemented with only 2 reg sources (dst:=cmov(src1,eflags))
> and if the condition fails is turned to a no-op. But i've no proof of this.
With ordinary register renaming using a Register Alias Table, this would not be possible since dependent operations need to read the name of the destination register. If the old name is kept in the RAT (to support no-op), then if the move is performed all dependent operations would have the wrong name; if a new name is inserted (like other instructions with a register destination) and the move is not performed, all dependent operations would have the wrong name.
Renaming using a priority CAM would avoid this problem. While this kind of design scales very poorly for general renaming, it might not be unthinkable for just conditional moves. However, I suspect that such an irregularity would not be worthwhile (in terms of complexity/area/power compared to benefit).
Special casing the handling of flags might allow some simplification of issuing conditional operations.
Prediction of which of three operands will not be the last available would also allow a two-source issue queue to handle such operations. I suspect that the common case for conditional move and add with carry is for the old value to be available no later than the later of flag and other operand so misprediction would not be common.
Using virtual physical register renaming (where the name in the RAT can be a non-physical register name which is then translated a second time) would allow conditional operations to be treated as no-ops when the condition fails. However, making one condition a no-op removes the result forwarding benefit that would come from reading both old and alternative values for the conditional move instruction, requiring dependent operations to read the register file. Also, without ordinary result forwarding, instruction issue would become more complex since only in this special case is the register to read not known until after the instruction provide input operands has completed. (Virtual physical registers were conceived to reduce the need for physical registers since renamed but uncompleted operations would not need physical registers. It can also be used to exploit banking in the register file since bank selection for writes can be done at instruction completion and so generally avoid bank conflicts.)
Agner Fog indicates that since AMD uses macro-operations, cmove is only one operation: "A macro-operation can have any number of input dependencies. This means that instructions with more than two input dependencies, such as MOV [EAX+EBX],ECX, ADC EAX,EBX and CMOVBE EAX,EBX, generate only one macro-operation, while they require two micro-operations on Intel processors." (p. 163, The microarchitecture of Intel, AMD and VIA CPUs, 2013-09-04 version)
[snip]
> Yes, in all the AMD processors i'm aware cmovc has 1 cycle latency, which
> is the "natural" latency for a cmov/select type of instruction.
> What i find strange is in that all Intel's cpu (except P4, which was way worse) cmovcc
> is 2 cycles. Agner's explaination is that all the ins with more than 2 sources are cracked
> in two u-ops and cmov has 3 sources (something like dst:=select(src1,src2,eflags))
> which seems reasonable, but doesn't explain why AMD's cpu doesn't have this problem.
> My 2cent is that cmovcc could be implemented with only 2 reg sources (dst:=cmov(src1,eflags))
> and if the condition fails is turned to a no-op. But i've no proof of this.
With ordinary register renaming using a Register Alias Table, this would not be possible since dependent operations need to read the name of the destination register. If the old name is kept in the RAT (to support no-op), then if the move is performed all dependent operations would have the wrong name; if a new name is inserted (like other instructions with a register destination) and the move is not performed, all dependent operations would have the wrong name.
Renaming using a priority CAM would avoid this problem. While this kind of design scales very poorly for general renaming, it might not be unthinkable for just conditional moves. However, I suspect that such an irregularity would not be worthwhile (in terms of complexity/area/power compared to benefit).
Special casing the handling of flags might allow some simplification of issuing conditional operations.
Prediction of which of three operands will not be the last available would also allow a two-source issue queue to handle such operations. I suspect that the common case for conditional move and add with carry is for the old value to be available no later than the later of flag and other operand so misprediction would not be common.
Using virtual physical register renaming (where the name in the RAT can be a non-physical register name which is then translated a second time) would allow conditional operations to be treated as no-ops when the condition fails. However, making one condition a no-op removes the result forwarding benefit that would come from reading both old and alternative values for the conditional move instruction, requiring dependent operations to read the register file. Also, without ordinary result forwarding, instruction issue would become more complex since only in this special case is the register to read not known until after the instruction provide input operands has completed. (Virtual physical registers were conceived to reduce the need for physical registers since renamed but uncompleted operations would not need physical registers. It can also be used to exploit banking in the register file since bank selection for writes can be done at instruction completion and so generally avoid bank conflicts.)
Agner Fog indicates that since AMD uses macro-operations, cmove is only one operation: "A macro-operation can have any number of input dependencies. This means that instructions with more than two input dependencies, such as MOV [EAX+EBX],ECX, ADC EAX,EBX and CMOVBE EAX,EBX, generate only one macro-operation, while they require two micro-operations on Intel processors." (p. 163, The microarchitecture of Intel, AMD and VIA CPUs, 2013-09-04 version)