Article: AMD's Mobile Strategy
By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), December 18, 2011 6:56 pm
Room: Moderated Discussions
Linus Torvalds (torvalds@linux-foundation.org) on 12/18/11 wrote:
>
>Or maybe it's from the horrible P4 "load after store"
>behavior [..]
To explain: if you keep things in memory, and operate
on them directly multiple times, this made-up sequence
works fine to "round up to even"
inc mem
and $~1,mem
even if it's not optimal, it has the advantage of not using
any registers etc, and it's simple (the above is the gas
syntax with "destination last", the DOS/Intel syntax is
"destination first"). The above could easily be the output
from a non-optimizing compiler, or for the "minimize code
size" case.
But the P4 has some bad scheduling cases where if you read
from a memory location just the right number of cycles after
you wrote to it, the write hasn't "completed" yet and
the read has been moved earlier, and you can get into
retry hell and what should take a cycle or two takes ten
or twenty. I forget the exact details, and the above code
example is meant to be just an example of the concent, not
necessarily the way to really trigger it.
So it turns out that for P4, for reasons totally unrelated
to instruction decode or uops, under those circumstances
you can be *much* better off writing that as
mov mem,reg
inc reg
and $~1,reg
mov reg,mem
and blowing a register on it. The above is usually a bit
faster even on other Intel CPU's (ie a couple of cycles),
but on P4 it can be a order-of-magnitude difference (a
couple of cycles vs a couple of tens of cycles).
(Never mind that "inc" is also deprecated - feel free to
replace it with an add mentally if it disturbs you)
But who knows. I'm just speculating where the "don't use
mem-op instructions on P4" meme may have come from.
Linus
>
>Or maybe it's from the horrible P4 "load after store"
>behavior [..]
To explain: if you keep things in memory, and operate
on them directly multiple times, this made-up sequence
works fine to "round up to even"
inc mem
and $~1,mem
even if it's not optimal, it has the advantage of not using
any registers etc, and it's simple (the above is the gas
syntax with "destination last", the DOS/Intel syntax is
"destination first"). The above could easily be the output
from a non-optimizing compiler, or for the "minimize code
size" case.
But the P4 has some bad scheduling cases where if you read
from a memory location just the right number of cycles after
you wrote to it, the write hasn't "completed" yet and
the read has been moved earlier, and you can get into
retry hell and what should take a cycle or two takes ten
or twenty. I forget the exact details, and the above code
example is meant to be just an example of the concent, not
necessarily the way to really trigger it.
So it turns out that for P4, for reasons totally unrelated
to instruction decode or uops, under those circumstances
you can be *much* better off writing that as
mov mem,reg
inc reg
and $~1,reg
mov reg,mem
and blowing a register on it. The above is usually a bit
faster even on other Intel CPU's (ie a couple of cycles),
but on P4 it can be a order-of-magnitude difference (a
couple of cycles vs a couple of tens of cycles).
(Never mind that "inc" is also deprecated - feel free to
replace it with an add mentally if it disturbs you)
But who knows. I'm just speculating where the "don't use
mem-op instructions on P4" meme may have come from.
Linus