Rob Thorpe ( on 5/10/07 wrote:
>Still it's not that much of a problem, a code generator
>should be able to minimize the effect.

Sure, but the world would be a better place if you didn't
have to make uarch-specific optimizations, and could just
get code that works well for everybody. It looks like the
optimizations for Core 2 and Barcelona are generally going
to work well for both, though.

[ Using push/pop in prologues/epilogues ]
>Are you sure that doing it that way has a significant
>effect on x86-64 code? Generally what's done is that the
>stack pointer is adjusted once then the local variables
>are MOVed into place.

The optimization explicitly says that using push is now
preferable (and that's a change wrt previous AMD rules).

And yes, the size difference can be quite noticeable. A
push is a single-byte op (two for the regs that need REX),
while a "mov to stack" is 5 bytes or more.

It adds up. When every single function tends to do several
of these things both in the entry and exit path. And since
you often have multiple epilogues, it also makes things
like short conditional branches less effective.

Is it "significant"? I don't think it's a huge issue on its
own, but it's a part of making code denser and getting
better I$ behaviour. I may be crazy, but I think I$ density

>(Also, I wonder if the two things you mention are
>connected. To make push faster they will have redesigned
>the way stack related ops work including ret.)

Maybe. Or it might just be a branch prediction artifact,
where the prediction done by "ret" might be special-cased
into an earlier stage, and ends up conflicting with the
prediction done for the previous conditional branch.

There appears to be something magical about the single-byte
"ret" form, because the manual states that you can avoid
the problem with the 3-byte "ret $0" form. But yeah, that
one would probably have two stages of stack op logic,
so it could well be about that..

