By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), October 31, 2008 8:08 am
Room: Moderated Discussions
Howard Chu (hyc@symas.com) on 10/31/08 wrote:
>
>Given the flakiness of the tools, it would have been
>worthwhile to hack the codepath selection in each of the
>programs-under-test, to force identical codepaths on both
>platforms.
That really isn't very easy at all.
In fact, even if you make your CPU lie about cpuid (by
using virtualization, for example), or force the software
to ignore cpuid and always use the same code path, there
is really a bigger problem: timing-based path selection.
For example, David used "misses per kilo-instruction" as
a way to 'normalize' the numbers, but that's often not a
good normalization at all.
Why? Because rather than 'normalize' things for path
differences, it can cause seriously misleading values in
the face of anything that is timing-sensitive.
For example, let's assume that some of the benchmarks are
almost entirely limited by the graphics card (which is not
at all unlikely for the high-quality cases for some of the
games). What does that lead to?
It leads to the CPU being throttled, and while throttling,
you're going to get a very special code-path selection,
and not one that is at all dependent on the type of CPU.
Now, if the throttling ends up doing something that
isn't counted at all (for example, it might halt the CPU
waiting for an interrupt from the graphics card), you are
going to get numbers that are still largely "relevant". The
"misses-per-instruction" is still a valid number.
But quite often, throttling ends up being a busy loop. Yeah,
the game may end up doing AI while waiting for graphics,
and just generally doing something relevant. But it's also
quite possible that the throttling ends up being some kind
of busy loop.
Now, the "busy loop" may be a really big one, like the
Windows idle loop, but it can be a fairly tight one as
well. Especially for a game that is single-threaded, and
doesn't care about multi-tasking (and many games do not),
I can well see the case of "graphics card is busy" being
a very tight busy loop that just reads a status register.
And if so, your "per instruction" values may be very
misleading indeed. Depending on just how much you wait for
the graphics card, your statistics may be swamped not by
the actual work you do, but by all the dead time.
That's true regardless of whether the loop is large or
small, but with a small loop the results can be even more
misleading, especially if looking at things like cache
misses per instruction - your numbers may be more indicative
of the loop than of the load you actually want to measure,
and a tight loop i likely to be more wildly different from
the real load than a large one.
Things like that can really make your numbers be
meaningless, and hard to compare across CPU's (not just
different architectures, but even with the same
microarchitecture, just running at different speeds).
It's quite possible that the games David tested had no
such issues, but in general, I would suggest that if there
is a possibility of timing-related measurement affecting the
end result, you should try to test otherwise identical
machines with different CPU speeds to at least verify that
timing does not make a huge difference.
So it would be interesting to hear, for example, whether
the Intel Core 2 numbers (that seemed to be much more
reliable) were similar when running at 2.93GHz and when
running at (say) 1.86GHz.
If they are similar, you have a much better confidence in
the numbers being meaningful. And if they are not, then
you know that what you're looking at isn't even tied to
microarchitecture, so comparing two different uarcs using
the numbers is now much less likely to be interesting.
Think of it as an inherent "error bar". How big is it?
Linus
>
>Given the flakiness of the tools, it would have been
>worthwhile to hack the codepath selection in each of the
>programs-under-test, to force identical codepaths on both
>platforms.
That really isn't very easy at all.
In fact, even if you make your CPU lie about cpuid (by
using virtualization, for example), or force the software
to ignore cpuid and always use the same code path, there
is really a bigger problem: timing-based path selection.
For example, David used "misses per kilo-instruction" as
a way to 'normalize' the numbers, but that's often not a
good normalization at all.
Why? Because rather than 'normalize' things for path
differences, it can cause seriously misleading values in
the face of anything that is timing-sensitive.
For example, let's assume that some of the benchmarks are
almost entirely limited by the graphics card (which is not
at all unlikely for the high-quality cases for some of the
games). What does that lead to?
It leads to the CPU being throttled, and while throttling,
you're going to get a very special code-path selection,
and not one that is at all dependent on the type of CPU.
Now, if the throttling ends up doing something that
isn't counted at all (for example, it might halt the CPU
waiting for an interrupt from the graphics card), you are
going to get numbers that are still largely "relevant". The
"misses-per-instruction" is still a valid number.
But quite often, throttling ends up being a busy loop. Yeah,
the game may end up doing AI while waiting for graphics,
and just generally doing something relevant. But it's also
quite possible that the throttling ends up being some kind
of busy loop.
Now, the "busy loop" may be a really big one, like the
Windows idle loop, but it can be a fairly tight one as
well. Especially for a game that is single-threaded, and
doesn't care about multi-tasking (and many games do not),
I can well see the case of "graphics card is busy" being
a very tight busy loop that just reads a status register.
And if so, your "per instruction" values may be very
misleading indeed. Depending on just how much you wait for
the graphics card, your statistics may be swamped not by
the actual work you do, but by all the dead time.
That's true regardless of whether the loop is large or
small, but with a small loop the results can be even more
misleading, especially if looking at things like cache
misses per instruction - your numbers may be more indicative
of the loop than of the load you actually want to measure,
and a tight loop i likely to be more wildly different from
the real load than a large one.
Things like that can really make your numbers be
meaningless, and hard to compare across CPU's (not just
different architectures, but even with the same
microarchitecture, just running at different speeds).
It's quite possible that the games David tested had no
such issues, but in general, I would suggest that if there
is a possibility of timing-related measurement affecting the
end result, you should try to test otherwise identical
machines with different CPU speeds to at least verify that
timing does not make a huge difference.
So it would be interesting to hear, for example, whether
the Intel Core 2 numbers (that seemed to be much more
reliable) were similar when running at 2.93GHz and when
running at (say) 1.86GHz.
If they are similar, you have a much better confidence in
the numbers being meaningful. And if they are not, then
you know that what you're looking at isn't even tied to
microarchitecture, so comparing two different uarcs using
the numbers is now much less likely to be interesting.
Think of it as an inherent "error bar". How big is it?
Linus