Article: AMD's Mobile Strategy
By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), December 18, 2011 12:48 pm
Room: Moderated Discussions
Michael S (already5chosen@yahoo.com) on 12/18/11 wrote:
>
>Why post-Pentium 4?
Because P4 wanted simple instructions. Really.
>All pre-Pentium-M Intel CPUs starting from i486 and, may be, even from i386 prefer,
>as general code generation strategy, code that consists of minimal # of uOps.
Yes. But with the various decode rules added in. And in
reality, decode rules were much more important than the
uop count rules, because the uop counts were usually fairly
close.
So in practince, counting uops is totally secondary.
>So, I'd say that RISC-like code generation is most useful for P5 and P6.
No. For P5, the dual pairing rules were most important,
and had nothing to do with RISC-like decode. You wanted
to get the whole U/V pipe thing right, in order to actually
issue two instructions at a time.
For P6, the 3-1-1 pattern was the primary thing to keep in
mind, but it didn't care that deeply about the uops.
It's most important for P4. I'm not sure exactly why that
is the case, but I suspect it is because it ends up
relaxing scheduling in the trace cache or something. The
exact rules for how many trace cache entries are used in
P4 (and the alignment in the trace cache) are complex and
odd.
IOW, for P4 it's not just about number of uops, it's about
how they pack in that dang trace cache.
>Pentium4, on the other hand, handles load-op and load-op-store instructions no better and
>no worse than equivalent 2 or 3 instr. RISC-like sequences. So, I'd expect from
>Pentium4-oriented compiler to emit such sequences opportunistically in order to improve code density.
Look again. Seriously. Just read the Intel optimization
manual. It is very clear, and for the longest time made
it a rule that you shouldn't use load-op instructions, you
should do loads and separate ops.
And the reason they said that was that P6 didn't care all
that much (although it would have preferred the decode
rules - but Intel cared much more about P4 code than code
for old chips they no longer sold). But the P4 did care.
Really. Just compare the optimization manuals. It's
quite clear. These days Intel suggests you use as many of
the complex addressing modes and combined ops as possible,
but that's a 180-degree change from the P4 days.
The P4 also had its own additional rules (often around the
slow shifter - avoid constant shifts and multiplies, use
long sequences of "lea" instructions if at all possible).
And the P4 rules actually *really* sucked if the code was
not in the trace cache, because then the P4 would be almost
entirely decode-limited (only one instruction per cycle),
so the "long sequence of 'lea' instead of a shift" was
actally exactly the wrong thing to do.
So in reality, the P4 rules were much more subtle, but
they were impossible to really get right, so you had some
code (usually tight loops) that ran like a bat out of hell,
and anything that didn't fit in the trace cache (or took
any of the myriad of microfaults) was slower than molasses.
Linus
>
>Why post-Pentium 4?
Because P4 wanted simple instructions. Really.
>All pre-Pentium-M Intel CPUs starting from i486 and, may be, even from i386 prefer,
>as general code generation strategy, code that consists of minimal # of uOps.
Yes. But with the various decode rules added in. And in
reality, decode rules were much more important than the
uop count rules, because the uop counts were usually fairly
close.
So in practince, counting uops is totally secondary.
>So, I'd say that RISC-like code generation is most useful for P5 and P6.
No. For P5, the dual pairing rules were most important,
and had nothing to do with RISC-like decode. You wanted
to get the whole U/V pipe thing right, in order to actually
issue two instructions at a time.
For P6, the 3-1-1 pattern was the primary thing to keep in
mind, but it didn't care that deeply about the uops.
It's most important for P4. I'm not sure exactly why that
is the case, but I suspect it is because it ends up
relaxing scheduling in the trace cache or something. The
exact rules for how many trace cache entries are used in
P4 (and the alignment in the trace cache) are complex and
odd.
IOW, for P4 it's not just about number of uops, it's about
how they pack in that dang trace cache.
>Pentium4, on the other hand, handles load-op and load-op-store instructions no better and
>no worse than equivalent 2 or 3 instr. RISC-like sequences. So, I'd expect from
>Pentium4-oriented compiler to emit such sequences opportunistically in order to improve code density.
Look again. Seriously. Just read the Intel optimization
manual. It is very clear, and for the longest time made
it a rule that you shouldn't use load-op instructions, you
should do loads and separate ops.
And the reason they said that was that P6 didn't care all
that much (although it would have preferred the decode
rules - but Intel cared much more about P4 code than code
for old chips they no longer sold). But the P4 did care.
Really. Just compare the optimization manuals. It's
quite clear. These days Intel suggests you use as many of
the complex addressing modes and combined ops as possible,
but that's a 180-degree change from the P4 days.
The P4 also had its own additional rules (often around the
slow shifter - avoid constant shifts and multiplies, use
long sequences of "lea" instructions if at all possible).
And the P4 rules actually *really* sucked if the code was
not in the trace cache, because then the P4 would be almost
entirely decode-limited (only one instruction per cycle),
so the "long sequence of 'lea' instead of a shift" was
actally exactly the wrong thing to do.
So in reality, the P4 rules were much more subtle, but
they were impossible to really get right, so you had some
code (usually tight loops) that ran like a bat out of hell,
and anything that didn't fit in the trace cache (or took
any of the myriad of microfaults) was slower than molasses.
Linus