Performance "speed limits"

By: Travis Downs (, June 11, 2019 9:26 am
Room: Moderated Discussions
Thanks for all the corrections Paul! It is really amazing how many typos can slide past even with a proof reading (frankly I hate proof reading stuff I wrote since I already know what's coming next, so I really start to skim - the only way it works to come back a year later when I've forgotten most of it).

Paul A. Clayton ( on June 11, 2019 5:04 am wrote:
> Thank you, that was fun reading. Although it is not a bottleneck, mentioning branch prediction
> accuracy might have been appropriate (probably in the "Out of Order Limits" section).

Yes, I wanted to, but I became tired and it was very long already. Also that topic deserves a more detailed treatment. Still, I will add a small section on it. I think it is is a bottleneck: given the fetch -> resolution delay of 14-16 cycles, the count of mispredicts puts an upper bound on the performance: you resolve mispredicts any faster than 14*M, and in fact some code that mispredicts *a lot* (think some types of compression algorithms, some simulations) gets bounded by that.

Where it gets complicated is when you are missing a lot but not enough to get to the 14*M bound: you do substantial work in between the branches. In this case the limit doesn't compose in a straightforward way with the other limits (it is not independent) - it tends to slow you down even if it's not a strict limit, but not always! Maybe fodder for another post...

Also, technically,
> "prefetcher friendly access patterns" would not apply to "Memory and Cache Bandwidth" but to "Out
> of Order Limits" since prefetching does not reduce demand bandwidth (though DRAM page access clustering
> can increase achieved bandwidth) but avoids stalls related to window size.

It may interact with "out of order limits" but I don't think it actually belongs in that section, since it is not a part of the OoO machinery. Practically speaking many other limits also interact with the OoO stuff - and that whole section is a bit of an ugly duckling in some ways because it doesn't fit cleanly in the "X things/cycle" model of the other limits... you start to fall into mentally simulating the pipeline (which is great, but not the point of this simpler methodology).

BTW, prefetching, at least on Intel, is definitely also about increasing achieved bandwidth: there aren't enough fill buffers from the L1 to saturate the request buffers from the L2 outwards, so a big part of the prefetcher's role is to fill those L2->DRAM buffers that can't be filled by demand requests. Another bandwidth increasing role is to keep the occupancy time lower by fetching inter outer cache levels, so the total sustainable bandwidth is higher.

AFAIK DRAM page access clustering is not a primary function of prefetching (at least the L2 driven prefetching we see in Intel): that would be the job of the memory controller, which orders requests to make best use of open pages. IIRC AMD had memory controller-driven prefetching, where that might apply?

You are right though that it can't increase the demand bandwidth demand, but increasing the achieved bandwidth has the same effect in the end. So yeah, limit is in a bit of a grey area: normally the other limits are hard limits and you need to reduce the use of the limited resource, but in the case of memory bandwidth you can also increase the effective bandwidth. Maybe I'll add a caveat about this.

> "These might look like important values. I even made a table, probably the only table in
> this whole post." might be a little better as "These might look like important values; I
> even made a table." (Technically, the image from Agner Fog's book was a table.), and

As it turns out I made a second table after I wrote that, so I've updated the text.

Thanks again for all the suggestions - I have incorporated all of them.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Performance "speed limits"Travis Downs2019/06/11 01:23 AM
  Performance "speed limits"Adrian2019/06/11 03:52 AM
    Performance "speed limits"Travis Downs2019/06/11 09:28 AM
  Performance "speed limits"Paul A. Clayton2019/06/11 05:04 AM
    correction of my corrections!Paul A. Clayton2019/06/11 05:07 AM
    Performance "speed limits"Peter E. Fry2019/06/11 07:19 AM
      Performance "speed limits"Travis Downs2019/06/11 09:36 AM
    Performance "speed limits"Travis Downs2019/06/11 09:26 AM
  Performance "speed limits"Branches2019/06/11 08:04 AM
  Performance "speed limits"anon2019/06/11 07:06 PM
    Performance "speed limits"Travis Downs2019/06/11 07:12 PM
      Thank you, very nice writeup (NT)anon2019/06/11 07:37 PM
  Performance "speed limits"anon2019/06/11 07:34 PM
    Performance "speed limits"Maynard Handley2019/06/12 10:13 PM
    Performance "speed limits"Travis Downs2019/06/13 01:05 PM
Reply to this Topic
Body: No Text
How do you spell purple?