Article: AMD's Jaguar Microarchitecture
By: SHK (nomail.delete@this.mail.com), April 2, 2014 5:45 am
Room: Moderated Discussions
> cmov on Jaguar is decoded as 1 uop, w/ 1 cycle latency, can be executed on either ALU pipe.
> Reference AMD's optimization table:
> http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/AMD64_16h_InstrLatency_1.1.xlsx
> Or search out Agner Fog's excellent optimization manuals.
>
> -rupley
> speaking only for myself
Yes, in all the AMD processors i'm aware cmovc has 1 cycle latency, which is the "natural" latency for a cmov/select type of instruction.
What i find strange is in that all Intel's cpu (except P4, which was way worse) cmovcc is 2 cycles. Agner's explaination is that all the ins with more than 2 sources are cracked in two u-ops and cmov has 3 sources (something like dst:=select(src1,src2,eflags))
which seems reasonable, but doesn't explain why AMD's cpu doesn't have this problem. My 2cent is that cmovcc could be implemented with only 2 reg sources (dst:=cmov(src1,eflags)) and if the condition fails is turned to a no-op. But i've no proof of this.
> Reference AMD's optimization table:
> http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/AMD64_16h_InstrLatency_1.1.xlsx
> Or search out Agner Fog's excellent optimization manuals.
>
> -rupley
> speaking only for myself
Yes, in all the AMD processors i'm aware cmovc has 1 cycle latency, which is the "natural" latency for a cmov/select type of instruction.
What i find strange is in that all Intel's cpu (except P4, which was way worse) cmovcc is 2 cycles. Agner's explaination is that all the ins with more than 2 sources are cracked in two u-ops and cmov has 3 sources (something like dst:=select(src1,src2,eflags))
which seems reasonable, but doesn't explain why AMD's cpu doesn't have this problem. My 2cent is that cmovcc could be implemented with only 2 reg sources (dst:=cmov(src1,eflags)) and if the condition fails is turned to a no-op. But i've no proof of this.