Article: AMD's Mobile Strategy
By: Michael S (already5chosen.delete@this.yahoo.com), December 18, 2011 4:01 pm
Room: Moderated Discussions
Linus Torvalds (torvalds@linux-foundation.org) on 12/18/11 wrote:
>
>Really. Just compare the optimization manuals. It's
>quite clear. These days Intel suggests you use as many of
>the complex addressing modes and combined ops as possible,
>but that's a 180-degree change from the P4 days.
>
I started with Intel optimization manual from Nov 2006. It contains exactly two lines of text about Pentium4 specific front end optimization:
"Because ?ops are delivered from the trace cache in the common cases, decoding rules and code alignment are not required."
So I took older manual 24896609.pdf. Don't remember the exact date, but likely 2003 Q2 or Q3.
The only suggestion that remotely reminds what you're sayin is this one: "Avoid complex instructions that require more than 4 uops". Obviously, it has nothing to do with our discussion, because complex instructions we are talking about consist of either 2 uops (load-op) or 4 uops (read-modyfy-write).
Here is another rule from the same manual that suggests exactly the same "opportunistic complexization" that I mentioned in my previous post:
Assembly/Compiler Coding Rule 59. (M impact, ML generality) For arithmetic or logical operations that have source operand in memory and the destination operand is in a register, attempt a strategy that initially loads the memory operand to register followed by a register to register ALU operation. Next, attempt to remove redundant loads by identifying loads from the same memory location. Finally, combine the remaining loads with their corresponding ALU operations.
Or how about Rule 60?
Give preference to adding a register to memory (memory is destination) instead of adding memory to a register. Also, give preference to adding a register to memory over loading the memory, adding two registers and storing the result.
Then I downloaded even older 24896607.pdf. This one precedes Pentium-M, so pure P4.
Here I see even more obvious rules suggesting the same "opportunistic complexization":
Assembly/Compiler Coding Rule 58. (M impact, ML generality)
Instead of explicitly loading the the memory operand into a register and then operation on it, reduce register pressure by using the memory operand directly, if that memory operand is not reused soon.
So, so far it seems to me that your memory failed you with regard to Intel optimization manuals.
>
>Really. Just compare the optimization manuals. It's
>quite clear. These days Intel suggests you use as many of
>the complex addressing modes and combined ops as possible,
>but that's a 180-degree change from the P4 days.
>
I started with Intel optimization manual from Nov 2006. It contains exactly two lines of text about Pentium4 specific front end optimization:
"Because ?ops are delivered from the trace cache in the common cases, decoding rules and code alignment are not required."
So I took older manual 24896609.pdf. Don't remember the exact date, but likely 2003 Q2 or Q3.
The only suggestion that remotely reminds what you're sayin is this one: "Avoid complex instructions that require more than 4 uops". Obviously, it has nothing to do with our discussion, because complex instructions we are talking about consist of either 2 uops (load-op) or 4 uops (read-modyfy-write).
Here is another rule from the same manual that suggests exactly the same "opportunistic complexization" that I mentioned in my previous post:
Assembly/Compiler Coding Rule 59. (M impact, ML generality) For arithmetic or logical operations that have source operand in memory and the destination operand is in a register, attempt a strategy that initially loads the memory operand to register followed by a register to register ALU operation. Next, attempt to remove redundant loads by identifying loads from the same memory location. Finally, combine the remaining loads with their corresponding ALU operations.
Or how about Rule 60?
Give preference to adding a register to memory (memory is destination) instead of adding memory to a register. Also, give preference to adding a register to memory over loading the memory, adding two registers and storing the result.
Then I downloaded even older 24896607.pdf. This one precedes Pentium-M, so pure P4.
Here I see even more obvious rules suggesting the same "opportunistic complexization":
Assembly/Compiler Coding Rule 58. (M impact, ML generality)
Instead of explicitly loading the the memory operand into a register and then operation on it, reduce register pressure by using the memory operand directly, if that memory operand is not reused soon.
So, so far it seems to me that your memory failed you with regard to Intel optimization manuals.