By: Patrick Chase (patrickjchase.delete@this.gmail.com), July 9, 2015 9:43 am
Room: Moderated Discussions
Maynard Handley (name99.delete@this.name99.org) on July 8, 2015 8:23 pm wrote:
> Inspired by this paper, I looked at the Geekbench3 Lua single core results (which
> I assume are basically an interpreter, and thus as good a proxy as one can hope
> for in figuring out this stuff). The results are very interesting.
> A8 gets 1787/1.4 =1270 (score/frequency)
> Sandy Bridge gets 4269/3.4=1255
> Haswell gets 4325/3.3=1310
> Nehalem gets 2284/3.2=713
> (64-bit for everything except the Nehalem result where I could not find a 64-bit Window
> result. For some strange reason there are also no Broadwell 64-bit results yet; out of
> interest the 32-bit result is 2693/2.3=1.17, which perhaps we can take as indicating a
> 10% penalty for 32-bit mode, giving us some feel for what a 64-bit Nehalem result might
> be.)
Dividing score by MHz is an invariably misleading exercise because memory latency doesn't scale like that in the real world, for two reasons:
1. High-clocked designs need deeper pipelines and therefore require more accurate speculation merely to hold their own. Put another way, a slow design can get the same per-MHz performance with a weaker branch predictor because it "loses" less instructions on each mispredict (and in the limiting case slow designs need no prediction at all as they can resolve the branch before its result is needed). P4 exhibited this phenomenon in spades - its predictors (branch, set, etc) were actually quite competitive, but the microarchitecture had been pushed so aggressively that the results were problematic.
2. Memory latency is (to a first order) a constant time, not a constant number of clocks.
You can address (2) by downclocking the faster design, but (1) is much harder to correct for as it requires a nonexistent uarch (a "short-pipe" variant of the faster design).
> It's merely a hint, not a proof, but it suggests that the intuition is correct (that is
> it gives the expected big jump in performance for an interpreter from Nehalem to Sandy
> Bridge).
I agree with the last part of that statement - The Nehalem->SB results are indeed as expected, and are useful inasmuch as they reflect designs that were optimized for and evaluated at similar clock rates. I also agree with the first part, though even "hint" may be overstating the usefulness of the data. I don't think your intuition gains *any* support from this, though. It's a meaningless exercise with meaningless results.
> It also suggests that whatever Apple is using for their branch predictor it's pretty
> impressive. Perhaps not at the TAGE-ITTAGE level of Haswell (especially when we consider
> that they can make up for a few more branch mispredictions when they can run wider)
More to the point they can *tolerate* a lot more mispredictions (probably on the order of twice the rate) because they run so much slower. I don't see any evidence either way of for predictor quality here, just meaningless numbers.
> Inspired by this paper, I looked at the Geekbench3 Lua single core results (which
> I assume are basically an interpreter, and thus as good a proxy as one can hope
> for in figuring out this stuff). The results are very interesting.
> A8 gets 1787/1.4 =1270 (score/frequency)
> Sandy Bridge gets 4269/3.4=1255
> Haswell gets 4325/3.3=1310
> Nehalem gets 2284/3.2=713
> (64-bit for everything except the Nehalem result where I could not find a 64-bit Window
> result. For some strange reason there are also no Broadwell 64-bit results yet; out of
> interest the 32-bit result is 2693/2.3=1.17, which perhaps we can take as indicating a
> 10% penalty for 32-bit mode, giving us some feel for what a 64-bit Nehalem result might
> be.)
Dividing score by MHz is an invariably misleading exercise because memory latency doesn't scale like that in the real world, for two reasons:
1. High-clocked designs need deeper pipelines and therefore require more accurate speculation merely to hold their own. Put another way, a slow design can get the same per-MHz performance with a weaker branch predictor because it "loses" less instructions on each mispredict (and in the limiting case slow designs need no prediction at all as they can resolve the branch before its result is needed). P4 exhibited this phenomenon in spades - its predictors (branch, set, etc) were actually quite competitive, but the microarchitecture had been pushed so aggressively that the results were problematic.
2. Memory latency is (to a first order) a constant time, not a constant number of clocks.
You can address (2) by downclocking the faster design, but (1) is much harder to correct for as it requires a nonexistent uarch (a "short-pipe" variant of the faster design).
> It's merely a hint, not a proof, but it suggests that the intuition is correct (that is
> it gives the expected big jump in performance for an interpreter from Nehalem to Sandy
> Bridge).
I agree with the last part of that statement - The Nehalem->SB results are indeed as expected, and are useful inasmuch as they reflect designs that were optimized for and evaluated at similar clock rates. I also agree with the first part, though even "hint" may be overstating the usefulness of the data. I don't think your intuition gains *any* support from this, though. It's a meaningless exercise with meaningless results.
> It also suggests that whatever Apple is using for their branch predictor it's pretty
> impressive. Perhaps not at the TAGE-ITTAGE level of Haswell (especially when we consider
> that they can make up for a few more branch mispredictions when they can run wider)
More to the point they can *tolerate* a lot more mispredictions (probably on the order of twice the rate) because they run so much slower. I don't see any evidence either way of for predictor quality here, just meaningless numbers.