By: David Kanter (dkanter.delete@this.realworldtech.com), November 3, 2008 9:52 am
Room: Moderated Discussions
Linus Torvalds (torvalds@linux-foundation.org) on 10/31/08 wrote:
---------------------------
>Howard Chu (hyc@symas.com) on 10/31/08 wrote:
>>
>>Given the flakiness of the tools, it would have been
>>worthwhile to hack the codepath selection in each of the
>>programs-under-test, to force identical codepaths on both
>>platforms.
>
>That really isn't very easy at all.
>
>In fact, even if you make your CPU lie about cpuid (by
>using virtualization, for example), or force the software
>to ignore cpuid and always use the same code path, there
>is really a bigger problem: timing-based path selection.
I would tend to agree. There was a great paper on TPC-C, that showed that small changes in the timing of cache misses could result in pretty big OS scheduling differences. IIRC, in one run they injected a cache miss every 100 cycles (0,100,200...), while in a second run they injected a miss every 100 cycles, starting at cycle 50, 150...
The overall difference in performance turned out to be around 5-10%.
Unfortunately I don't know how to avoid this problem, although as you later pointed out, detection is feasible.
>For example, David used "misses per kilo-instruction" as
>a way to 'normalize' the numbers, but that's often not a
>good normalization at all.
>
>Why? Because rather than 'normalize' things for path
>differences, it can cause seriously misleading values in
>the face of anything that is timing-sensitive.
>
>For example, let's assume that some of the benchmarks are
>almost entirely limited by the graphics card (which is not
>at all unlikely for the high-quality cases for some of the
>games). What does that lead to?
>
>It leads to the CPU being throttled, and while throttling,
>you're going to get a very special code-path selection,
>and not one that is at all dependent on the type of CPU.
>
>Now, if the throttling ends up doing something that
>isn't counted at all (for example, it might halt the CPU
>waiting for an interrupt from the graphics card), you are
>going to get numbers that are still largely "relevant". The
>"misses-per-instruction" is still a valid number.
So I counted non-halted clock cycles. How the CPU handles being told to go idle is unclear - I'd hope that it does halt, but it may not.
>But quite often, throttling ends up being a busy loop. Yeah,
>the game may end up doing AI while waiting for graphics,
>and just generally doing something relevant. But it's also
>quite possible that the throttling ends up being some kind
>of busy loop.
>
>Now, the "busy loop" may be a really big one, like the
>Windows idle loop, but it can be a fairly tight one as
>well. Especially for a game that is single-threaded, and
>doesn't care about multi-tasking (and many games do not),
>I can well see the case of "graphics card is busy" being
>a very tight busy loop that just reads a status register.
>
>And if so, your "per instruction" values may be very
>misleading indeed. Depending on just how much you wait for
>the graphics card, your statistics may be swamped not by
>the actual work you do, but by all the dead time.
>
>That's true regardless of whether the loop is large or
>small, but with a small loop the results can be even more
>misleading, especially if looking at things like cache
>misses per instruction - your numbers may be more indicative
>of the loop than of the load you actually want to measure,
>and a tight loop i likely to be more wildly different from
>the real load than a large one.
>
>Things like that can really make your numbers be
>meaningless, and hard to compare across CPU's (not just
>different architectures, but even with the same
>microarchitecture, just running at different speeds).
>
>It's quite possible that the games David tested had no
>such issues, but in general, I would suggest that if there
>is a possibility of timing-related measurement affecting the
>end result, you should try to test otherwise identical
>machines with different CPU speeds to at least verify that
>timing does not make a huge difference.
>
>So it would be interesting to hear, for example, whether
>the Intel Core 2 numbers (that seemed to be much more
>reliable) were similar when running at 2.93GHz and when
>running at (say) 1.86GHz.
>
>If they are similar, you have a much better confidence in
>the numbers being meaningful. And if they are not, then
>you know that what you're looking at isn't even tied to
>microarchitecture, so comparing two different uarcs using
>the numbers is now much less likely to be interesting.
>
>Think of it as an inherent "error bar". How big is it?
>
>Linus
---------------------------
>Howard Chu (hyc@symas.com) on 10/31/08 wrote:
>>
>>Given the flakiness of the tools, it would have been
>>worthwhile to hack the codepath selection in each of the
>>programs-under-test, to force identical codepaths on both
>>platforms.
>
>That really isn't very easy at all.
>
>In fact, even if you make your CPU lie about cpuid (by
>using virtualization, for example), or force the software
>to ignore cpuid and always use the same code path, there
>is really a bigger problem: timing-based path selection.
I would tend to agree. There was a great paper on TPC-C, that showed that small changes in the timing of cache misses could result in pretty big OS scheduling differences. IIRC, in one run they injected a cache miss every 100 cycles (0,100,200...), while in a second run they injected a miss every 100 cycles, starting at cycle 50, 150...
The overall difference in performance turned out to be around 5-10%.
Unfortunately I don't know how to avoid this problem, although as you later pointed out, detection is feasible.
>For example, David used "misses per kilo-instruction" as
>a way to 'normalize' the numbers, but that's often not a
>good normalization at all.
>
>Why? Because rather than 'normalize' things for path
>differences, it can cause seriously misleading values in
>the face of anything that is timing-sensitive.
>
>For example, let's assume that some of the benchmarks are
>almost entirely limited by the graphics card (which is not
>at all unlikely for the high-quality cases for some of the
>games). What does that lead to?
>
>It leads to the CPU being throttled, and while throttling,
>you're going to get a very special code-path selection,
>and not one that is at all dependent on the type of CPU.
>
>Now, if the throttling ends up doing something that
>isn't counted at all (for example, it might halt the CPU
>waiting for an interrupt from the graphics card), you are
>going to get numbers that are still largely "relevant". The
>"misses-per-instruction" is still a valid number.
So I counted non-halted clock cycles. How the CPU handles being told to go idle is unclear - I'd hope that it does halt, but it may not.
>But quite often, throttling ends up being a busy loop. Yeah,
>the game may end up doing AI while waiting for graphics,
>and just generally doing something relevant. But it's also
>quite possible that the throttling ends up being some kind
>of busy loop.
>
>Now, the "busy loop" may be a really big one, like the
>Windows idle loop, but it can be a fairly tight one as
>well. Especially for a game that is single-threaded, and
>doesn't care about multi-tasking (and many games do not),
>I can well see the case of "graphics card is busy" being
>a very tight busy loop that just reads a status register.
>
>And if so, your "per instruction" values may be very
>misleading indeed. Depending on just how much you wait for
>the graphics card, your statistics may be swamped not by
>the actual work you do, but by all the dead time.
>
>That's true regardless of whether the loop is large or
>small, but with a small loop the results can be even more
>misleading, especially if looking at things like cache
>misses per instruction - your numbers may be more indicative
>of the loop than of the load you actually want to measure,
>and a tight loop i likely to be more wildly different from
>the real load than a large one.
>
>Things like that can really make your numbers be
>meaningless, and hard to compare across CPU's (not just
>different architectures, but even with the same
>microarchitecture, just running at different speeds).
>
>It's quite possible that the games David tested had no
>such issues, but in general, I would suggest that if there
>is a possibility of timing-related measurement affecting the
>end result, you should try to test otherwise identical
>machines with different CPU speeds to at least verify that
>timing does not make a huge difference.
>
>So it would be interesting to hear, for example, whether
>the Intel Core 2 numbers (that seemed to be much more
>reliable) were similar when running at 2.93GHz and when
>running at (say) 1.86GHz.
>
>If they are similar, you have a much better confidence in
>the numbers being meaningful. And if they are not, then
>you know that what you're looking at isn't even tied to
>microarchitecture, so comparing two different uarcs using
>the numbers is now much less likely to be interesting.
>
>Think of it as an inherent "error bar". How big is it?
>
>Linus