Article: AMD's Jaguar Microarchitecture
By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), April 2, 2014 11:48 am
Room: Moderated Discussions
SHK (nomail.delete@this.mail.com) on April 2, 2014 6:45 am wrote:
>
> Yes, in all the AMD processors i'm aware cmovc has 1 cycle latency, which
> is the "natural" latency for a cmov/select type of instruction.
Are you sure it's the "cmov" that really has a 2-cycle latency?
Look at "adc" and "sbb" too - they are also documented as being two-cycle latency instructions too for Sandy Bridge, afaik.
So maybe it's not cmov/adc/sbb themselves that are two cycles, it might just be the eflags generation that is one cycle delayed after the result of the previous instruction.
The most common use of flags is obviously for conditional branches, and for them it doesn't (much) matter if the eflags value is generated one cycle later, since the timing-critical part is the prediction, not the actual condition value. So at most it means that the verification of the prediction is delayed by a cycle - which might make the mispredict penalty higher, but it all probably depends very much on just how the pipelining works, so..
But adc/sbb/cmov obviously really care synchronously about the flags value, so if the condition code generation is done one cycle later in the pipeline, that would appear as a two-cycle latency for the instruction that uses it.
Now, the odd case out is "setcc", which is documented to be a single-cycle latency instruction, so maybe it's about the three-input thing rather than the placement of eflags generation in the pipeline. That said, "setcc" is simple enough to possibly be done using some special bypass since it doesn't need an ALU or or anything like that - the result coming in the same cycle as eflags is generated doesn't sound impossible.
Linus
>
> Yes, in all the AMD processors i'm aware cmovc has 1 cycle latency, which
> is the "natural" latency for a cmov/select type of instruction.
Are you sure it's the "cmov" that really has a 2-cycle latency?
Look at "adc" and "sbb" too - they are also documented as being two-cycle latency instructions too for Sandy Bridge, afaik.
So maybe it's not cmov/adc/sbb themselves that are two cycles, it might just be the eflags generation that is one cycle delayed after the result of the previous instruction.
The most common use of flags is obviously for conditional branches, and for them it doesn't (much) matter if the eflags value is generated one cycle later, since the timing-critical part is the prediction, not the actual condition value. So at most it means that the verification of the prediction is delayed by a cycle - which might make the mispredict penalty higher, but it all probably depends very much on just how the pipelining works, so..
But adc/sbb/cmov obviously really care synchronously about the flags value, so if the condition code generation is done one cycle later in the pipeline, that would appear as a two-cycle latency for the instruction that uses it.
Now, the odd case out is "setcc", which is documented to be a single-cycle latency instruction, so maybe it's about the three-input thing rather than the placement of eflags generation in the pipeline. That said, "setcc" is simple enough to possibly be done using some special bypass since it doesn't need an ALU or or anything like that - the result coming in the same cycle as eflags is generated doesn't sound impossible.
Linus