Core uarch instruction latencies

By: David Kanter (dkanter.delete@this.realworldtech.com), June 2, 2006 12:20 am
Room: Moderated Discussions
I ran across a very interesting post at ace's hardware. While I usually don't like to cross post information, I think there are some folks here who would find this interesting. The OP is a fellow named BLL, and just as a warning some of this stuff may come out a little misaligned:

Sorry for this long post but I think it contains some interesting information about Intel NGMA architecture for assembly programmers.

- These thoughts are based on EVEREST Instruction Latency dump. If you don't believe in software measurements, wait for the official Intel guide and hope it will be more detailed than the current one. ;) (You can create such dump in EVEREST by right-clicking on the bottom status bar of EVEREST main window -> CPU Debug -> Instruction Latency Dump. It fully works on trial version too.)
- In this dump latency means the time that it takes for the next dependent same-type instruction to start. Throughput means the time that it takes for the next independent same-type instruction to start:



Lat: add rax, rax TP: add rax, rax

add rax, rax add rbx, rbx

add rax, rax add rcx, rcx

...


- These values are measured by long chains of instructions (~6000), so these are the sustained rates, peak values can be higher.
- Some instructions never depend on a previous one: they use different source and destination register sets or have memory operand, so this way it is not always possible to measure the instruction latency.
- If TP value is less than 1, it means that more than one same-type instruction can start in the same clock cycle.
- For some x87 instruction combination (and for some SSE in 32b mode) the 8 registers aren't enough to measure the instruction throughput.
- It is a measurement, not a constant table, so some values are not rounded.
- Keep in mind that even though instruction latency and throughput are important, they may not directly reflect CPU performance.
About NGMA:
-- It's a bit surprising that the TP of the most simple integer instructions is 0.33. It was expected to be 0.25, but it seems it isn't possible for same-type instructions;
-- NGMA inherited the 1-clock LEA's of the P6 descendants;
-- NGMA can do 2 shifts per clock as opposed to Yonah's 1;
-- NGMA can do 1 rotation per clock just as Yonah;
-- NGMA has 3lat/1tp 32b 2 and 3 operand IMUL as opposed to 4lat/1tp of Yonah. 32b 1 operand (I)MUL remained 4 and 5 cycles long. 64b 2 and 3 operand IMUL is 5l/2tp, 1 operand (I)MUL (which generated 128b results) is 7l/7tp;
-- NGMA inherited the very fast 32b (I)DIV from Dothan and Yonah. The 64b DIV is just 31 clocks - awesome! (Prescott takes 64b DIV ~107clk)
-- NGMA x87 capabilities are very similar to the P6 descendants: 3l/1tp FADD, 5l/2tp FMUL. NGMA can handle just 1 FXCH per clock as opposed to Yonah's 3.
-- FDIV and FSQRT latencies are exactly the same as the ancient PPro's.
-- NGMA inherited from P6 the handling of all the special cases (where the results are trivial) - so this new core can recognize most speedup possibilites. (Netburst could win almost nothing from special cases.)
-- NGMA has fantastic SSEn implementation; there isn't any difference between the packed and scalar version's latency and throughput (except for the Horizontal ADD/SUB in SSE3), even the slow DIV and SQRT act similarly.
-- NGMA supports the handling of all the special cases for DIV/SQRTSS|PS|SD|PD too.
-- It seems -- at least in this revision -- that NGMA doesn't support the fused multiply-add.
-- NGMA can do 3 xmm register moving per clock with 1 latency. "MOVAPS/MOVAPD/MOVDQA reg, reg" are as free as FXCH used to be in the good old days.
-- NGMA can multiply double values with 1tp - MULPD is 5l/1tp as opposed to Yonah's 5l/4tp and Prescott's 7l/2tp.
-- NGMA can recognize the zeroing out "XORPS/XORPD xmm, xmm" too, they are always independent instructions as the integer "XOR reg, reg".



Family: 6 Modell: 0f Stepping: 4

I 0 X86 :NOP L: TP: 0.12ns= 0.33c

I 1 X86 :0x66 NOP L: TP: 0.12ns= 0.33c

I 2 X86 :2 0x66 NOP L: TP: 0.12ns= 0.33c

I 3 X86 :3 0x66 NOP L: TP: 0.12ns= 0.33c

I 4 X86 :4 0x66 NOP L: TP: 0.12ns= 0.33c

I 5 X86 :5 0x66 NOP L: TP: 0.14ns= 0.38c

I 6 X86 :6 0x66 NOP L: TP: 0.16ns= 0.43c

I 7 X86 :7 0x66 NOP L: TP: 0.19ns= 0.50c

I 8 X86 :8 0x66 NOP L: TP: 0.21ns= 0.56c

I 9 X86 :9 0x66 NOP L: TP: 0.23ns= 0.63c

I 10 X86 :MOV r8, imm8 L: 0.12ns= 0.3c TP: 0.12ns= 0.33c

I 11 X86 :MOV r16, imm16 L: 0.66ns= 1.8c TP: 0.66ns= 1.75c

I 12 X86 :MOV r32, imm32 L: 0.12ns= 0.3c TP: 0.12ns= 0.33c

I 13 AMD64 :MOV r64, imm64 L: 0.23ns= 0.6c TP: 0.23ns= 0.63c

I 14 X86 :MOV r8, r8 L: 0.37ns= 1.0c TP: 0.12ns= 0.32c

I 15 X86 :MOV r16, r16 L: 0.37ns= 1.0c TP: 0.13ns= 0.33c

I 16 X86 :MOV r32, r32 L: 0.37ns= 1.0c TP: 0.12ns= 0.32c

I 17 AMD64 :MOV r64, r64 L: 0.37ns= 1.0c TP: 0.13ns= 0.33c

I 18 X86 :MOV r8, [memr8] L: 1.50ns= 4.0c TP: 0.37ns= 1.00c

I 19 X86 :MOV r16, [memr16] L: 1.50ns= 4.0c TP: 0.37ns= 1.00c

I 20 X86 :MOV r32, [memr32] L: 1.12ns= 3.0c TP: 0.37ns= 1.00c

I 21 AMD64 :MOV r64, [memr64] L: 1.12ns= 3.0c TP: 0.37ns= 1.00c

I 22 X86 :MOV [memr8], r8 L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I 23 X86 :MOV [memr16], r16 L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I 24 X86 :MOV [memr32], r32 L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I 25 AMD64 :MOV [memr64], r64 L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I 26 CMOV :CMOV r16, r16 L: 0.75ns= 2.0c TP: 0.37ns= 1.00c

I 27 CMOV :CMOV r32, r32 L: 0.75ns= 2.0c TP: 0.37ns= 1.00c

I 28 AMD64 :CMOV r64, r64 L: 0.75ns= 2.0c TP: 0.37ns= 1.00c

I 29 X86 :ADD r8, r8 L: 0.37ns= 1.0c TP: 0.12ns= 0.32c

I 30 X86 :ADD r16, r16 L: 0.37ns= 1.0c TP: 0.13ns= 0.33c

I 31 X86 :ADD r32, r32 L: 0.37ns= 1.0c TP: 0.12ns= 0.32c

I 32 AMD64 :ADD r64, r64 L: 0.37ns= 1.0c TP: 0.13ns= 0.33c

I 33 X86 :ADC r8, r8 L: 0.75ns= 2.0c TP: 0.75ns= 2.00c

I 34 X86 :ADC r16, r16 L: 0.75ns= 2.0c TP: 0.75ns= 2.00c

I 35 X86 :ADC r32, r32 L: 0.75ns= 2.0c TP: 0.75ns= 2.00c

I 36 AMD64 :ADC r64, r64 L: 0.75ns= 2.0c TP: 0.75ns= 2.00c

I 37 X86 :CMP r8, r8 L: TP: 0.12ns= 0.33c

I 38 X86 :CMP r16, r16 L: TP: 0.12ns= 0.33c

I 39 X86 :CMP r32, r32 L: TP: 0.12ns= 0.33c

I 40 AMD64 :CMP r64, r64 L: TP: 0.12ns= 0.33c

I 41 X86 :CMP r8_1, r8_2 L: TP: 0.12ns= 0.33c

I 42 X86 :CMP r16_1, r16_2 L: TP: 0.12ns= 0.33c

I 43 X86 :CMP r32_1, r32_2 L: TP: 0.12ns= 0.33c

I 44 AMD64 :CMP r64_1, r64_2 L: TP: 0.12ns= 0.33c

I 45 X86 :AND r8, r8 L: 0.37ns= 1.0c TP: 0.12ns= 0.32c

I 46 X86 :AND r16, r16 L: 0.37ns= 1.0c TP: 0.13ns= 0.33c

I 47 X86 :AND r32, r32 L: 0.37ns= 1.0c TP: 0.12ns= 0.32c

I 48 AMD64 :AND r64, r64 L: 0.37ns= 1.0c TP: 0.13ns= 0.33c

I 49 X86 :AND r8_1, r8_2 L: 0.37ns= 1.0c TP: 0.17ns= 0.47c

I 50 X86 :AND r16_1, r16_2 L: 0.37ns= 1.0c TP: 0.09ns= 0.23c

I 51 X86 :AND r32_1, r32_2 L: 0.37ns= 1.0c TP: 0.21ns= 0.55c

I 52 AMD64 :AND r64_1, r64_2 L: 0.37ns= 1.0c TP: 0.09ns= 0.23c

I 53 X86 :OR r8, r8 L: 0.37ns= 1.0c TP: 0.12ns= 0.32c

I 54 X86 :OR r16, r16 L: 0.37ns= 1.0c TP: 0.13ns= 0.33c

I 55 X86 :OR r32, r32 L: 0.37ns= 1.0c TP: 0.12ns= 0.32c

I 56 AMD64 :OR r64, r64 L: 0.37ns= 1.0c TP: 0.13ns= 0.33c

I 57 X86 :OR r8_1, r8_2 L: 0.37ns= 1.0c TP: 0.17ns= 0.47c

I 58 X86 :OR r16_1, r16_2 L: 0.37ns= 1.0c TP: 0.09ns= 0.23c

I 59 X86 :OR r32_1, r32_2 L: 0.37ns= 1.0c TP: 0.21ns= 0.55c

I 60 AMD64 :OR r64_1, r64_2 L: 0.37ns= 1.0c TP: 0.09ns= 0.23c

I 61 X86 :XOR r8, r8 L: 0.12ns= 0.3c TP: 0.12ns= 0.33c

I 62 X86 :XOR r16, r16 L: 0.12ns= 0.3c TP: 0.12ns= 0.33c

I 63 X86 :XOR r32, r32 L: 0.12ns= 0.3c TP: 0.12ns= 0.33c

I 64 AMD64 :XOR r64, r64 L: 0.12ns= 0.3c TP: 0.12ns= 0.33c

I 65 X86 :XOR r8_1, r8_2 L: 0.37ns= 1.0c TP: 0.17ns= 0.47c

I 66 X86 :XOR r16_1, r16_2 L: 0.37ns= 1.0c TP: 0.09ns= 0.23c

I 67 X86 :XOR r32_1, r32_2 L: 0.37ns= 1.0c TP: 0.21ns= 0.55c

I 68 AMD64 :XOR r64_1, r64_2 L: 0.37ns= 1.0c TP: 0.09ns= 0.23c

I 69 X86 :INC r8 L: 0.37ns= 1.0c TP: 0.12ns= 0.32c

I 70 X86 :INC r16 L: 0.37ns= 1.0c TP: 0.13ns= 0.33c

I 71 X86 :INC r32 L: 0.37ns= 1.0c TP: 0.12ns= 0.32c

I 72 AMD64 :INC r64 L: 0.37ns= 1.0c TP: 0.13ns= 0.34c

I 73 X86 :LEA r16, [r+r] L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I 74 X86 :LEA r32, [r+r] L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I 75 AMD64 :LEA r64, [r+r] L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I 76 X86 :LEA r16, [r+r+imm] L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I 77 X86 :LEA r32, [r+r+imm] L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I 78 AMD64 :LEA r64, [r+r+imm] L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I 79 X86 :LEA r16, [r+8*r] L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I 80 X86 :LEA r32, [r+8*r] L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I 81 AMD64 :LEA r64, [r+8*r] L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I 82 X86 :LEA r16, [r+8*r+imm] L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I 83 X86 :LEA r32, [r+8*r+imm] L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I 84 AMD64 :LEA r64, [r+8*r+imm] L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I 85 X86 :SHL r8, 1 L: 0.37ns= 1.0c TP: 0.17ns= 0.46c

I 86 X86 :SHL r16, 1 L: 0.37ns= 1.0c TP: 0.19ns= 0.50c

I 87 X86 :SHL r32, 1 L: 0.37ns= 1.0c TP: 0.17ns= 0.46c

I 88 AMD64 :SHL r64, 1 L: 0.37ns= 1.0c TP: 0.19ns= 0.50c

I 89 X86 :SHL r8, cl L: 0.37ns= 1.0c TP: 0.17ns= 0.45c

I 90 X86 :SHL r16, cl L: 0.37ns= 1.0c TP: 0.19ns= 0.50c

I 91 X86 :SHL r32, cl L: 0.37ns= 1.0c TP: 0.17ns= 0.45c

I 92 AMD64 :SHL r64, cl L: 0.37ns= 1.0c TP: 0.19ns= 0.50c

I 93 X86 :SHL r8, imm8 L: 0.37ns= 1.0c TP: 0.19ns= 0.50c

I 94 X86 :SHL r16, imm8 L: 0.37ns= 1.0c TP: 0.19ns= 0.50c

I 95 X86 :SHL r32, imm8 L: 0.37ns= 1.0c TP: 0.19ns= 0.50c

I 96 AMD64 :SHL r64, imm8 L: 0.37ns= 1.0c TP: 0.19ns= 0.50c

I 97 X86 :SHR r8, 1 L: 0.37ns= 1.0c TP: 0.17ns= 0.46c

I 98 X86 :SHR r16, 1 L: 0.37ns= 1.0c TP: 0.19ns= 0.50c

I 99 X86 :SHR r32, 1 L: 0.37ns= 1.0c TP: 0.17ns= 0.46c

I100 AMD64 :SHR r64, 1 L: 0.37ns= 1.0c TP: 0.19ns= 0.50c

I101 X86 :SHR r8, cl L: 0.37ns= 1.0c TP: 0.17ns= 0.45c

I102 X86 :SHR r16, cl L: 0.37ns= 1.0c TP: 0.19ns= 0.50c

I103 X86 :SHR r32, cl L: 0.37ns= 1.0c TP: 0.17ns= 0.45c

I104 AMD64 :SHR r64, cl L: 0.37ns= 1.0c TP: 0.19ns= 0.50c

I105 X86 :SHR r8, imm8 L: 0.37ns= 1.0c TP: 0.19ns= 0.50c

I106 X86 :SHR r16, imm8 L: 0.37ns= 1.0c TP: 0.19ns= 0.50c

I107 X86 :SHR r32, imm8 L: 0.37ns= 1.0c TP: 0.19ns= 0.50c

I108 AMD64 :SHR r64, imm8 L: 0.37ns= 1.0c TP: 0.19ns= 0.50c

I109 X86 :ROL r8, 1 L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I110 X86 :ROL r16, 1 L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I111 X86 :ROL r32, 1 L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I112 AMD64 :ROL r64, 1 L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I113 X86 :ROL r8, cl L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I114 X86 :ROL r16, cl L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I115 X86 :ROL r32, cl L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I116 AMD64 :ROL r64, cl L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I117 X86 :ROL r8, imm8 L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I118 X86 :ROL r16, imm8 L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I119 X86 :ROL r32, imm8 L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I120 AMD64 :ROL r64, imm8 L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I121 X86 :ROR r8, 1 L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I122 X86 :ROR r16, 1 L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I123 X86 :ROR r32, 1 L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I124 AMD64 :ROR r64, 1 L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I125 X86 :ROR r8, cl L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I126 X86 :ROR r16, cl L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I127 X86 :ROR r32, cl L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I128 AMD64 :ROR r64, cl L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I129 X86 :ROR r8, imm8 L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I130 X86 :ROR r16, imm8 L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I131 X86 :ROR r32, imm8 L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I132 AMD64 :ROR r64, imm8 L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I133 X86 :RCL r8, 1 L: 0.75ns= 2.0c TP: 0.75ns= 2.00c

I134 X86 :RCL r16, 1 L: 0.75ns= 2.0c TP: 0.75ns= 2.00c

I135 X86 :RCL r32, 1 L: 0.75ns= 2.0c TP: 0.75ns= 2.00c

I136 AMD64 :RCL r64, 1 L: 0.75ns= 2.0c TP: 0.75ns= 2.00c

I137 X86 :RCL r8, cl L: 4.50ns= 12.0c TP: 3.75ns= 10.00c

I138 X86 :RCL r16, cl L: 4.31ns= 11.5c TP: 3.75ns= 10.00c

I139 X86 :RCL r32, cl L: 4.31ns= 11.5c TP: 3.75ns= 10.00c

I140 AMD64 :RCL r64, cl L: 4.31ns= 11.5c TP: 3.75ns= 10.00c

I141 X86 :RCL r8, imm8 L: 4.50ns= 12.0c TP: 3.75ns= 10.00c

I142 X86 :RCL r16, imm8 L: 4.31ns= 11.5c TP: 3.75ns= 10.00c

I143 X86 :RCL r32, imm8 L: 4.31ns= 11.5c TP: 3.75ns= 10.00c

I144 AMD64 :RCL r64, imm8 L: 4.31ns= 11.5c TP: 3.75ns= 10.00c

I145 X86 :RCR r8, 1 L: 0.75ns= 2.0c TP: 0.75ns= 2.00c

I146 X86 :RCR r16, 1 L: 0.75ns= 2.0c TP: 0.75ns= 2.00c

I147 X86 :RCR r32, 1 L: 0.75ns= 2.0c TP: 0.75ns= 2.00c

I148 AMD64 :RCR r64, 1 L: 0.75ns= 2.0c TP: 0.75ns= 2.00c

I149 X86 :RCR r8, cl L: 4.59ns= 12.3c TP: 4.12ns= 11.00c

I150 X86 :RCR r16, cl L: 4.31ns= 11.5c TP: 3.75ns= 10.00c

I151 X86 :RCR r32, cl L: 4.31ns= 11.5c TP: 3.75ns= 10.00c

I152 AMD64 :RCR r64, cl L: 4.31ns= 11.5c TP: 3.75ns= 10.00c

I153 X86 :RCR r8, imm8 L: 4.59ns= 12.3c TP: 4.12ns= 11.00c

I154 X86 :RCR r16, imm8 L: 4.31ns= 11.5c TP: 3.75ns= 10.00c

I155 X86 :RCR r32, imm8 L: 4.31ns= 11.5c TP: 3.75ns= 10.00c

I156 AMD64 :RCR r64, imm8 L: 4.31ns= 11.5c TP: 3.75ns= 10.00c

I157 X86 :BSF r16, r16 L: 0.75ns= 2.0c TP: 0.37ns= 1.00c

I158 X86 :BSF r32, r32 L: 0.75ns= 2.0c TP: 0.37ns= 1.00c

I159 AMD64 :BSF r64, r64 L: 0.75ns= 2.0c TP: 0.37ns= 1.00c

I160 X86 :BSR r16, r16 L: 0.75ns= 2.0c TP: 0.37ns= 1.00c

I161 X86 :BSR r32, r32 L: 0.75ns= 2.0c TP: 0.37ns= 1.00c

I162 AMD64 :BSR r64, r64 L: 0.75ns= 2.0c TP: 0.37ns= 1.00c

I163 X86 :BSWAP r32 L: 1.50ns= 4.0c TP: 0.37ns= 1.00c

I164 AMD64 :BSWAP r64 L: 1.50ns= 4.0c TP: 0.37ns= 1.00c

I165 X86 :IMUL r16, r16 L: 1.12ns= 3.0c TP: 0.37ns= 1.00c

I166 X86 :IMUL r32, r32 L: 1.12ns= 3.0c TP: 0.37ns= 1.00c

I167 AMD64 :IMUL r64, r64 L: 1.87ns= 5.0c TP: 0.75ns= 2.00c

I168 X86 :IMUL r16, r16, imm8 L: 1.12ns= 3.0c TP: 0.37ns= 1.00c

I169 X86 :IMUL r32, r32, imm8 L: 1.12ns= 3.0c TP: 0.37ns= 1.00c

I170 AMD64 :IMUL r64, r64, imm8 L: 1.87ns= 5.0c TP: 0.75ns= 2.00c

I171 X86 :IMUL r16, r16, imm16 L: 1.12ns= 3.0c TP: 0.91ns= 2.42c

I172 X86 :IMUL r32, r32, imm32 L: 1.12ns= 3.0c TP: 0.37ns= 1.00c

I173 AMD64 :IMUL r64, r64, imm32 L: 1.87ns= 5.0c TP: 0.75ns= 2.00c

I174 X86 :IMUL r8 L: 1.50ns= 4.0c TP: 1.12ns= 3.00c

I175 X86 :IMUL r16 L: 1.81ns= 4.8c TP: 1.78ns= 4.75c

I176 X86 :IMUL r32 L: 1.81ns= 4.8c TP: 1.78ns= 4.75c

I177 AMD64 :IMUL r64 L: 2.62ns= 7.0c TP: 2.62ns= 7.00c

I178 X86 :MUL r8 L: 1.50ns= 4.0c TP: 1.12ns= 3.00c

I179 X86 :MUL r16 L: 1.81ns= 4.8c TP: 1.78ns= 4.75c

I180 X86 :MUL r32 L: 1.81ns= 4.8c TP: 1.78ns= 4.75c

I181 AMD64 :MUL r64 L: 2.62ns= 7.0c TP: 2.62ns= 7.00c

I182 X86 :IDIV r8 L: 6.78ns= 18.1c TP: 6.78ns= 18.08c

I183 X86 :IDIV r16 L: 6.37ns= 17.0c TP: 6.37ns= 17.00c

I184 X86 :IDIV r32 L: 6.37ns= 17.0c TP: 6.37ns= 17.00c

I185 AMD64 :IDIV r64 L: 15.37ns= 41.0c TP: 14.84ns= 39.58c

I186 X86 :DIV r8 L: 6.78ns= 18.1c TP: 6.78ns= 18.08c

I187 X86 :DIV r16 L: 6.37ns= 17.0c TP: 6.37ns= 17.00c

I188 X86 :DIV r32 L: 6.37ns= 17.0c TP: 6.37ns= 17.00c

I189 AMD64 :DIV r64 L: 11.75ns= 31.3c TP: 10.87ns= 29.00c

I190 X87 :FNOP L: TP: 0.37ns= 1.00c

I191 X87 :FXCH L: TP: 0.37ns= 1.00c

I192 X87 :FCHS L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I193 CMOV :FCMOV st, st(i) L: 0.75ns= 2.0c TP: 0.75ns= 2.00c

I194 X87 :FADD st(i), st L: 1.12ns= 3.0c TP: 0.37ns= 1.00c

I195 X87 :FADD st, st(i), FXCH st(i) L: 1.12ns= 3.0c TP: 0.37ns= 1.00c

I196 X87 :FMUL st(i), st L: 1.87ns= 5.0c TP: 0.66ns= 1.75c

I197 X87 :FMUL st, st(i), FXCH st(i) L: 1.87ns= 5.0c TP: 0.75ns= 2.00c

I198 X87 :FMUL+FADD st, st(i) L: 3.00ns= 8.0c TP:

I199 X87 :FMUL st(i), FADD st(i+1) L: 1.87ns= 5.0c TP:

I200 X87 :FDIV32 st(i), st L: 6.75ns= 18.0c TP: 6.37ns= 17.00c

I201 X87 :FDIV64 st(i), st L: 12.00ns= 32.0c TP: 11.62ns= 31.00c

I202 X87 :FDIV80 st(i), st L: 14.25ns= 38.0c TP: 13.87ns= 37.00c

I203 X87 :FDIV80 (0.0/x) L: 2.25ns= 6.0c TP: 1.87ns= 5.00c

I204 X87 :FDIV80 (x/1.0) L: 2.25ns= 6.0c TP: 1.87ns= 5.00c

I205 X87 :FDIV80 (x/2.0) L: 2.25ns= 6.0c TP: 1.87ns= 5.00c

I206 X87 :FSQRT32 st L: 10.87ns= 29.0c TP: 10.50ns= 28.00c

I207 X87 :FSQRT64 st L: 21.75ns= 58.0c TP: 21.37ns= 57.00c

I208 X87 :FSQRT80 st L: 25.87ns= 69.0c TP: 25.50ns= 68.00c

I209 X87 :FSQRT80 (0.0) L: 2.25ns= 6.0c TP: 1.87ns= 5.00c

I210 X87 :FSQRT80 (1.0) L: 2.25ns= 6.0c TP: 1.87ns= 5.00c

I211 X87 :FCOM st(i) L: TP: 0.37ns= 1.00c

I212 CMOV :FCOMI st(i) L: TP: 0.37ns= 1.00c

I213 X87 :FDECSTP L: TP: 0.37ns= 1.00c

I214 X87 :FINCSTP L: TP: 0.37ns= 1.00c

I215 MMX :MOVD r32, mm L: TP: 0.12ns= 0.33c

I216 AMD64 :MOVD r64, mm L: TP: 0.12ns= 0.33c

I217 MMX :MOVQ mm, mm L: 0.37ns= 1.0c TP: 0.13ns= 0.33c

I218 MMX :PADDD mm, mm L: 0.37ns= 1.0c TP: 0.19ns= 0.50c

I219 MMX :PMULHW mm, mm L: 1.12ns= 3.0c TP: 0.37ns= 1.00c

I220 MMX :PMADDWD mm, mm L: 1.12ns= 3.0c TP: 0.37ns= 1.00c

I221 MMX :PSRLQ mm, mm L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I222 MMX :PUNPCKHDQ mm, mm L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I223 MMX :PACKSSDW mm, mm L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I234 SSE :PMOVMSKB r32, mm L: TP: 0.37ns= 1.00c

I235 SSE :PSHUFW mm, mm, im8 L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I236 SSE :PSADBW mm, mm L: 1.12ns= 3.0c TP: 0.37ns= 1.00c

I237 SSE :MOVHLPS xmm, xmm L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I238 SSE :MOVHLPS xmm_1, xmm_2 L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I239 SSE :MOVAPS xmm, xmm L: 0.37ns= 1.0c TP: 0.12ns= 0.33c

I240 SSE :MOVAPS xmm, [memr128] L: TP: 0.37ns= 1.00c

I241 SSE :MOVUPS xmm, [memr128 + 4] L: TP: 0.75ns= 2.00c

I242 SSE :UNPCKHPS xmm, xmm L: 1.50ns= 4.0c TP: 0.37ns= 1.00c

I243 SSE :COMISS xmm, xmm L: TP: 0.37ns= 1.00c

I244 SSE :CMPSS xmm, xmm, imm8 L: 1.12ns= 3.0c TP: 0.37ns= 1.00c

I245 SSE :CMPPS xmm, xmm, imm8 L: 1.12ns= 3.0c TP: 0.37ns= 1.00c

I246 SSE :MOVSS xmm, xmm L: 0.37ns= 1.0c TP: 0.12ns= 0.33c

I247 SSE :MOVSS xmm, [memr32] L: TP: 0.37ns= 1.00c

I248 SSE :ADDSS xmm, xmm L: 1.12ns= 3.0c TP: 0.37ns= 1.00c

I249 SSE :ADDPS xmm, xmm L: 1.12ns= 3.0c TP: 0.37ns= 1.00c

I250 SSE :MULSS xmm, xmm L: 1.50ns= 4.0c TP: 0.37ns= 1.00c

I251 SSE :MULPS xmm, xmm L: 1.50ns= 4.0c TP: 0.37ns= 1.00c

I252 SSE :MULSS+ADDSS xmm, xmm L: 2.62ns= 7.0c TP: 0.56ns= 1.50c

I253 SSE :MULPS+ADDPS xmm, xmm L: 2.62ns= 7.0c TP: 0.37ns= 1.00c

I254 SSE :MULSS xm1,xm1 ADDSS xm2,xm2 L: 1.50ns= 4.0c TP: 0.37ns= 1.00c

I255 SSE :MULPS xm1,xm1 ADDPS xm2,xm2 L: 1.50ns= 4.0c TP: 0.37ns= 1.00c

I256 SSE :MINSS xmm, xmm L: 1.12ns= 3.0c TP: 0.37ns= 1.00c

I257 SSE :MINPS xmm, xmm L: 1.12ns= 3.0c TP: 0.37ns= 1.00c

I258 SSE :RCPSS xmm, xmm L: 1.12ns= 3.0c TP: 0.75ns= 2.00c

I259 SSE :RCPPS xmm, xmm L: 1.12ns= 3.0c TP: 0.75ns= 2.00c

I260 SSE :ANDNPS xmm, xmm L: 0.37ns= 1.0c TP: 0.13ns= 0.33c

I261 SSE :ANDNPS xmm_1, xmm_2 L: 0.37ns= 1.0c TP: 0.10ns= 0.26c

I262 SSE :ANDPS xmm, xmm L: 0.37ns= 1.0c TP: 0.12ns= 0.33c

I263 SSE :ANDPS xmm_1, xmm_2 L: 0.37ns= 1.0c TP: 0.10ns= 0.26c

I264 SSE :ORPS xmm, xmm L: 0.37ns= 1.0c TP: 0.12ns= 0.33c

I265 SSE :ORPS xmm_1, xmm_2 L: 0.37ns= 1.0c TP: 0.10ns= 0.26c

I266 SSE :XORPS xmm, xmm L: 0.12ns= 0.3c TP: 0.12ns= 0.33c

I267 SSE :XORPS xmm_1, xmm_2 L: 0.37ns= 1.0c TP: 0.10ns= 0.26c

I268 SSE :SHUFPS xmm, xmm, im8 L: 1.50ns= 4.0c TP: 0.37ns= 1.00c

I269 SSE :DIVSS xmm, xmm L: 6.75ns= 18.0c TP: 6.37ns= 17.00c

I270 SSE :DIVPS xmm, xmm L: 6.75ns= 18.0c TP: 6.37ns= 17.00c

I271 SSE :DIVSS (0.0/x) L: 2.25ns= 6.0c TP: 1.87ns= 5.00c

I272 SSE :DIVSS (x/1.0) L: 2.25ns= 6.0c TP: 1.87ns= 5.00c

I273 SSE :DIVSS (x/2.0) L: 2.25ns= 6.0c TP: 1.87ns= 5.00c

I274 SSE :DIVPS (0.0/x) L: 2.25ns= 6.0c TP: 1.87ns= 5.00c

I275 SSE :DIVPS (x/1.0) L: 2.25ns= 6.0c TP: 1.87ns= 5.00c

I276 SSE :DIVPS (x/2.0) L: 2.19ns= 5.8c TP: 1.87ns= 5.00c

I277 SSE :SQRTSS xmm, xmm L: 10.87ns= 29.0c TP: 10.50ns= 28.00c

I278 SSE :SQRTPS xmm, xmm L: 10.87ns= 29.0c TP: 10.50ns= 28.00c

I279 SSE :SQRTSS (0.0) L: 2.25ns= 6.0c TP: 1.87ns= 5.00c

I280 SSE :SQRTSS (1.0) L: 2.25ns= 6.0c TP: 1.87ns= 5.00c

I281 SSE :SQRTPS (0.0) L: 2.25ns= 6.0c TP: 1.87ns= 5.00c

I282 SSE :SQRTPS (1.0) L: 2.25ns= 6.0c TP: 1.87ns= 5.00c

I283 SSE2 :MOVAPD xmm, xmm L: 0.37ns= 1.0c TP: 0.12ns= 0.33c

I284 SSE2 :MOVAPD xmm, [memr128] L: TP: 0.37ns= 1.00c

I285 SSE2 :MOVUPD xmm, [memr128 + 4] L: TP: 0.47ns= 1.25c

I286 SSE2 :UNPCKHPD xmm, xmm L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I287 SSE2 :COMISD xmm, xmm L: TP: 0.37ns= 1.00c

I288 SSE2 :CMPSD xmm, xmm, imm8 L: 1.12ns= 3.0c TP: 0.37ns= 1.00c

I289 SSE2 :CMPPD xmm, xmm, imm8 L: 1.12ns= 3.0c TP: 0.37ns= 1.00c

I290 SSE2 :MOVQ xmm, xmm L: 0.37ns= 1.0c TP: 0.12ns= 0.33c

I291 SSE2 :MOVSD xmm, xmm L: 0.37ns= 1.0c TP: 0.12ns= 0.33c

I292 SSE2 :MOVSD xmm, [memr64] L: TP: 0.37ns= 1.00c

I293 SSE2 :ADDSD xmm, xmm L: 1.12ns= 3.0c TP: 0.37ns= 1.00c

I294 SSE2 :ADDPD xmm, xmm L: 1.12ns= 3.0c TP: 0.37ns= 1.00c

I295 SSE2 :MULSD xmm, xmm L: 1.87ns= 5.0c TP: 0.37ns= 1.00c

I296 SSE2 :MULPD xmm, xmm L: 1.87ns= 5.0c TP: 0.37ns= 1.00c

I297 SSE2 :MULSD+ADDSD xmm, xmm L: 3.00ns= 8.0c TP: 0.56ns= 1.50c

I298 SSE2 :MULPD+ADDPD xmm, xmm L: 3.00ns= 8.0c TP: 0.56ns= 1.50c

I299 SSE2 :MULSD xm1,xm1 ADDSD xm2,xm2 L: 1.87ns= 5.0c TP: 0.37ns= 1.00c

I300 SSE2 :MULPD xm1,xm1 ADDPD xm2,xm2 L: 1.87ns= 5.0c TP: 0.37ns= 1.00c

I301 SSE2 :MINSD xmm, xmm L: 1.12ns= 3.0c TP: 0.37ns= 1.00c

I302 SSE2 :MINPD xmm, xmm L: 1.12ns= 3.0c TP: 0.37ns= 1.00c

I303 SSE2 :ANDNPD xmm, xmm L: 0.37ns= 1.0c TP: 0.13ns= 0.34c

I304 SSE2 :ANDNPD xmm_1, xmm_2 L: 0.37ns= 1.0c TP: 0.10ns= 0.26c

I305 SSE2 :ANDPD xmm, xmm L: 0.37ns= 1.0c TP: 0.12ns= 0.33c

I306 SSE2 :ANDPD xmm_1, xmm_2 L: 0.37ns= 1.0c TP: 0.10ns= 0.26c

I307 SSE2 :ORPD xmm, xmm L: 0.37ns= 1.0c TP: 0.12ns= 0.33c

I308 SSE2 :ORPD xmm_1, xmm_2 L: 0.37ns= 1.0c TP: 0.10ns= 0.26c

I309 SSE2 :XORPD xmm, xmm L: 0.12ns= 0.3c TP: 0.12ns= 0.33c

I310 SSE2 :XORPD xmm_1, xmm_2 L: 0.37ns= 1.0c TP: 0.10ns= 0.26c

I311 SSE2 :SHUFPD xmm, xmm, im8 L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I312 SSE2 :DIVSD xmm, xmm L: 12.00ns= 32.0c TP: 11.62ns= 31.00c

I313 SSE2 :DIVPD xmm, xmm L: 12.00ns= 32.0c TP: 11.62ns= 31.00c

I314 SSE2 :DIVSD (0.0/x) L: 2.25ns= 6.0c TP: 1.87ns= 5.00c

I315 SSE2 :DIVSD (x/1.0) L: 2.25ns= 6.0c TP: 1.87ns= 5.00c

I316 SSE2 :DIVSD (x/2.0) L: 2.25ns= 6.0c TP: 1.87ns= 5.00c

I317 SSE2 :DIVPD (0.0/x) L: 2.25ns= 6.0c TP: 1.87ns= 5.00c

I318 SSE2 :DIVPD (x/1.0) L: 2.25ns= 6.0c TP: 1.87ns= 5.00c

I319 SSE2 :DIVPD (x/2.0) L: 2.19ns= 5.8c TP: 1.87ns= 5.00c

I320 SSE2 :SQRTSD xmm, xmm L: 21.75ns= 58.0c TP: 21.37ns= 57.00c

I321 SSE2 :SQRTPD xmm, xmm L: 21.75ns= 58.0c TP: 21.37ns= 57.00c

I322 SSE2 :SQRTSD (0.0) L: 2.25ns= 6.0c TP: 1.87ns= 5.00c

I323 SSE2 :SQRTSD (1.0) L: 2.25ns= 6.0c TP: 1.87ns= 5.00c

I324 SSE2 :SQRTPD (0.0) L: 2.25ns= 6.0c TP: 1.87ns= 5.00c

I325 SSE2 :SQRTPD (1.0) L: 2.25ns= 6.0c TP: 1.87ns= 5.00c

I326 SSE2 :MOVDQA xmm, xmm L: 0.37ns= 1.0c TP: 0.12ns= 0.33c

I327 SSE2 :PADDD xmm, xmm L: 0.37ns= 1.0c TP: 0.19ns= 0.50c

I328 SSE2 :PMULHW xmm, xmm L: 1.12ns= 3.0c TP: 0.37ns= 1.00c

I329 SSE2 :PMADDWD xmm, xmm L: 1.12ns= 3.0c TP: 0.37ns= 1.00c

I330 SSE2 :PSRLQ xmm, xmm L: 0.75ns= 2.0c TP: 0.37ns= 1.00c

I331 SSE2 :PSHUFD xmm, xmm L: 1.50ns= 4.0c TP: 0.37ns= 1.00c

I332 SSE2 :PUNPCKHDQ xmm, xmm L: 1.50ns= 4.0c TP: 0.75ns= 2.00c

I333 SSE2 :PACKSSDW xmm, xmm L: 1.50ns= 4.0c TP: 0.75ns= 2.00c

I334 SSE2 :PSADBW xmm, xmm L: 1.12ns= 3.0c TP: 0.37ns= 1.00c

I335 SSE3 :ADDSUBPS xmm, xmm L: 1.12ns= 3.0c TP: 0.37ns= 1.00c

I336 SSE3 :ADDSUBPD xmm, xmm L: 1.12ns= 3.0c TP: 0.37ns= 1.00c

I337 SSE3 :HADDPS xmm, xmm L: 3.37ns= 9.0c TP: 1.12ns= 3.00c

I338 SSE3 :HADDPD xmm, xmm L: 1.87ns= 5.0c TP: 0.75ns= 2.00c

I339 SSE3 :MOVDDUP xmm, xmm L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I340 SSE3 :MOVSHDUP xmm, xmm L: 0.37ns= 1.0c TP: 0.37ns= 1.00c

I341 SSE3 :LDDQU xmm, [memr128 + 4] L: TP: 0.47ns= 1.25c


Original post:
http://aceshardware.com/forums/read_post.jsp?id=120057200&forumid=1
 Next Post in Thread >
TopicPosted ByDate
Core uarch instruction latenciesDavid Kanter2006/06/02 12:20 AM
  Core uarch instruction latenciesLinus Torvalds2006/06/02 07:50 AM
    Linux kernelsavantu2006/06/02 10:06 AM
      Linux kernelLinus Torvalds2006/06/02 11:32 AM
        Thank you (NT)savantu2006/06/02 12:03 PM
    Core uarch instruction latenciesIan Ameline2006/06/02 02:41 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?