By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), November 15, 2014 12:53 pm
Room: Moderated Discussions
Patrick Chase (patrickjchase.delete@this.gmail.com) on November 15, 2014 12:26 pm wrote:
>
> w.r.t. the memory wall, that's simply wrong. To a first order, all that matters is the number of "lost"
> instructions on a stall. A 4-wide CPU with a 50-cycle memory latency is basically identical in terms of
> "memory wall susceptibility" to a 2-wide CPU with a 100-cycle memory latency. Either way you "lose" 200
> issue slots on a miss
Amen. Preach it brother.
The VLIW people (and yes, I saw this inside Transmeta) end up loving to ignore this, because they can choose to count "instruction" the way they want to. They'll count the individual sub-instructions when they want to look good and talk about how their IPC is fundamentally high, but then when counting cost of stalls (and not just memory stalls, but also things like mispredicted branches) they'll suddenly count cycles or bundles or whatever, because obviously a branch mispredict isn't "20+ subinstructions", it's just "7 cycles" or even "7 instructions". So it looks small, and it makes the short pipeline look like a great deal.
Because the meaning of "instruction" is whatever you want it to be.
You don't tend to see quite the same kind of dishonesty from the superscalar/OoO people with traditional instructions, because they don't think in terms of any "inherent parallelism", but instead of actual IPC - which has already taken the cost of missed branches or memory cycles into account.
So in that camp, the main problem is that a lot of people seem to think that IPC can be measured and compared across different frequencies and different absolute performance. You see that here on RWT, where people are drooling over some high-IPC design running at 1GHz, and dismissing the much faster CPU that runs at three times the frequency but has somewhat lower IPC.
Of course, then you have the people who count IPC for speculative work, and think that a code sequence that does both sides of a conditional and uses a conditional move or other predication is better than the conditional branch, because it looks better from an IPC standpoint, since in the conditional case the mispredicted unused work would have been flushed and isn't normally counted at all.
The Itanium people combined the speculative work problem with the VLIW problem.
Linus
>
> w.r.t. the memory wall, that's simply wrong. To a first order, all that matters is the number of "lost"
> instructions on a stall. A 4-wide CPU with a 50-cycle memory latency is basically identical in terms of
> "memory wall susceptibility" to a 2-wide CPU with a 100-cycle memory latency. Either way you "lose" 200
> issue slots on a miss
Amen. Preach it brother.
The VLIW people (and yes, I saw this inside Transmeta) end up loving to ignore this, because they can choose to count "instruction" the way they want to. They'll count the individual sub-instructions when they want to look good and talk about how their IPC is fundamentally high, but then when counting cost of stalls (and not just memory stalls, but also things like mispredicted branches) they'll suddenly count cycles or bundles or whatever, because obviously a branch mispredict isn't "20+ subinstructions", it's just "7 cycles" or even "7 instructions". So it looks small, and it makes the short pipeline look like a great deal.
Because the meaning of "instruction" is whatever you want it to be.
You don't tend to see quite the same kind of dishonesty from the superscalar/OoO people with traditional instructions, because they don't think in terms of any "inherent parallelism", but instead of actual IPC - which has already taken the cost of missed branches or memory cycles into account.
So in that camp, the main problem is that a lot of people seem to think that IPC can be measured and compared across different frequencies and different absolute performance. You see that here on RWT, where people are drooling over some high-IPC design running at 1GHz, and dismissing the much faster CPU that runs at three times the frequency but has somewhat lower IPC.
Of course, then you have the people who count IPC for speculative work, and think that a code sequence that does both sides of a conditional and uses a conditional move or other predication is better than the conditional branch, because it looks better from an IPC standpoint, since in the conditional case the mispredicted unused work would have been flushed and isn't normally counted at all.
The Itanium people combined the speculative work problem with the VLIW problem.
Linus