A final roundup of what we've learned

By: --- (---.delete@this.redheron.com), September 3, 2022 10:13 am
Room: Moderated Discussions
Overall this has been a somewhat frustrating experience for both of us. I think, even with goodwill on both sides, we've been frustrated by
- lack of a common terminology
- a lack of common background, so that we've each taken for granted things that seemed obvious to us (but were not), and then been forced to treat the other party as a child by trying to explain in baby language what we mean – an no-one likes being spoken to in baby language.

Even so, I think the whol experience has been worthwhile. I've been forced to think much more clearly about the Cloudflare graphs, I've learned the sorts of things I wanted to learn about some other CPU families, and I have modified my BTB discussion quite substantially.
During this interaction I've written about a bunch of things (eg prefetching, or how the M1 BTB is populated) which are interesting and tangentially related to the graphs, but below I'm going to include my final analysis of the Cloudflare graphs.

The TL;DR is that
- Apple probably has a two level BTB (Maynard wrong)
- The 2nd level BTB is at least 8192 entries in size and this size has nothing to do with the second jump in the M1 graphs (Cloudflare wrong)
- At the low end for both Apple and AMD we see the results of aliasing indices which give the L1 BTBs a substantially smaller effective size when the successive traces are power-of-2 separated (neither Maynard nor Cloudflare drew attention to this important effect)

If you want to read along, open the Cloudflare article and look at both the M1 predicted jmp graph and the AMD EPYC 7642 predicted jmp graph. I'll try to include those below - who knows if the tags will work!

https://blog.cloudflare.com/content/images/2021/05/pasted-image-0--6-.png
https://blog.cloudflare.com/content/images/2021/05/pasted-image-0--15-.png

At this point, maybe it's worth discussing the Cloudflare results:
https://blog.cloudflare.com/branch-predictor/
I no longer have access to an M1, so I cannot independently verify these but assuming the code is what I think it is, let's proceed.

Note the big jumps for Apple at 3072 and 6144, and for AMD at 4096.
In both Apple cases 3072 traces that are each 64B long, and 6144 traces each 32B long comes to 192KiB. But the AMD curves all jump at 4096.
To me this suggests that
- AMD's BTB (second level BTB) holds 4096 entries. Doesn't matter how long the traces are, things go bad once we exceed the BTB capacity and cannot accurately predict the next trace to fetch.
- 4096*64B=256KiB. This is much larger than the AMD L1I size. So AMD are successfully prefetching I-lines into the L1I. This could result from a simple sequential I-prefetcher (notice that three successive I-cache lines have been touched and prefetch the next one, like a data sequential prefetcher); or it could results from FDIP.
Note the block size/trace size is 64B but contains one jump, so one cache line (64B) is processed in what looks like about 3.5 cycles.
That means Address Prediction can possibly run ahead of Fetch and Decode (eg if some BTB accesses hit the faster BTB0 and BTB1, with the remaining accesses taking 3 cycles in BTB2), and runahead address prediction can be used for Fetch far in advance of when the instructions are decoded (so acting like a prefetch, like FDIP).

What about Apple? The Apple pattern is very different in that, at the high end, we start paying more cycles at the point where we exceed L1I, ie based on the length of the traces, not the number of traces. This suggests that
- the BTB infrastructure can hold at least 8192 traces
- for this type of code pattern Apple is not prefetching a line into the L1I, so that as soon as we exceed L1I size we pay the cost of an L2 access.
This in turn implies that Apple has neither a sequential I-fetch predictor (unsurprising; this is not a great design for an I-prefetcher) nor a fully decoupled fetch (which would be able to queue up the relevant fetch addresses in advance of cache access and thus have them act as FDIP).

The second difference between Apple and AMD is that AMD's cost per cycle (ie cost per trace) is the same at the high end, regardless of the trace length. This again tells us that the cost is a trace related cost, eg the cost of a mis-predict then flush.
For Apple the cost at the high end is different for larger traces vs shorter traces. This tells us the cost is associated with the amount of data being moved.
Assume Apple is moving two lines of data from L2 to L1I, and the L2 access delay is about 14 cycles. Now assume 64B blocks. Then that 14 cycles is amortized over two fetches (one that missed to L2, one that was able to hit the successor line that was next-line-prefetched into L1I) and so the average cost per block is 14/2+ some BTB overhead.
Likewise the same case but with 32B blocks. Now that 14 cycles is amortized over four fetches, so the average cost per block is 14/4+ some BTB overhead. That essentially matches what we see.

The second interesting question is what happens at the lower end.
For AMD we know that Zen2 has an L1 BTB of 512 entries and and L2 BTB of 7168 entries, according to https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Block_Diagram
I'm not sure why the 7168 BTB2 seems to give us an effective size (~4000) that's so low; maybe we are seeing terrible aliasing effects because the blocks all begin at power-of-2 address offsets, so we land up using only half the indices?
The same sort of thing seems to hold for the L1 BTB; we're hitting that and getting a throughput that looks like 1+some overhead, but only for about 256, not the full 512 entries.

Once again the Apple story is very different. Once again there isn't an obvious number of traces (ie number of blocks, to use the terminology of the graph) at which we jump, rather we jump from 1 cycle to 3 cycles when the total size of the loop (number of blocks * block size) exceeds 4096B.

So an initial hypothesis is that this is again a cache size issue.
As we will eventually see, again in great detail, Apple has a tiered set of Instruction "caches" depending on the exact details of the code. The simplest short loops are held in a loop buffer; while more complex loops(longer and/or requiring branch prediction because the branch outcomes are not constant), but which fit, are held in an L0I.

Suppose we are seeing performance within this L0I vs performance in the L1I.
+ This fits with the effect being linked to a constant loop size regardless of the block size.
+ On the other hand, I can't see any obvious reason why performance should drop when a loop exceeds the size of the L10! The L10 loop buffer exists for energy reasons, not performance reasons. And we don't see any sort of data movement penalty (cost dependent on block length) like we saw when M1 was missing L1I and had to go out to L2.
So I think this hypothesis fails.

A different interpretation of the low end of the Cloudflare graph is that it reflects a cleaner version of what we saw with with the AMD results: aliasing of address indices into the table, ie as the blocks get larger successive block addresses match at intermediate address bits, and so we use fewer and fewer sets.
This sort of thing can be fixed with a fancier BTB indexing hash, but maybe there isn't enough time for such hash?
If this were my code, I'd try a few runs with different block sizes, like say 2^n-1, to try to avoid this sort of aliasing.
If this interpretation is correct, we can summarize that:
+ Apple's primary BTB (we will see that there are other specialist BTBs) is 1024 entries in size, and uses a very simple indexing hash that is easily broken by powers-of-two addresses [which admittedly are extremely fake and not present in real life]!) AMD likewise probably doesn't care about such fake address aliasing possibilities.
+ with an L2 BTB that holds at least 8192 entries.

It is striking that Apple never mentions an L2 BTB in their BTB-related patents, but it's possible that this was considered an irrelevant distraction; or that the split from a single level medium-sized BTB to a larger, but split, BTB was implemented after the most recent BTB patent I have seen (2017).

One final question you may wonder is why was AMD, at the high end, only able to use half their L2 BTB? If this is indexing related, why didn't we see the same sort of staggered pattern of graph jumps, like we saw at the low-end for both Apple and AMD?
My guess is that
- some sort of hash of address bits is being performed for the L2 BTB (we've already accepted that that doesn't have to be single cycle, probably for both Apple and AMD) but
- AMD is passing through the second lowest address bit unchanged and unhashed as the lowest index bit (perhaps used as an energy saver, so only one of two halves of a tag SRAM are woken up), and since that second lowest bit never changes (our addresses are always four aligned) we only have access to half the indices. In more realistic code, fetch addresses would probably be randomly scattered at this 16bit granularity, so this would not be a real hardship to performance.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
What happened to mill computing? (NT)Kara2022/08/20 02:29 PM
  What happened to mill computing?Kara2022/08/20 02:30 PM
    A usualAnon2022/08/20 02:39 PM
      I don't get it Kara2022/08/20 05:31 PM
        I don't get it Anon2022/08/20 08:00 PM
          Mill people are working on itMarcus2022/08/21 06:18 AM
            Mill people are working on itHeikki Kultala2022/08/22 12:59 AM
              Mill people are working on itBrett2022/08/22 11:36 AM
                Mill people are working on itHeikki Kultala2022/08/23 08:19 AM
                  Mill people are working on itKara2022/08/23 08:59 AM
                    They call it a "CPU", not a DSP (NT)Heikki Kultala2022/08/23 11:15 AM
                  Mill people are working on itBrett2022/08/23 06:38 PM
                    Mill people are working on itHeikki Kultala2022/08/23 10:43 PM
                      Mill people are working on itnone2022/08/24 01:13 AM
                      Mill people are working on itBrett2022/08/24 02:40 AM
                        Totally clueless claims about OoOE costsHeikki Kultala2022/08/24 07:30 AM
                          Totally clueless claims about OoOE costsBrett2022/08/24 03:47 PM
                        OoO vs scalar transistor cost exampleMark Roulo2022/08/24 09:04 AM
                          OoO vs scalar transistor cost exampleKara2022/08/24 11:42 AM
                            Sorry I meant to reply to heikki (NT)Kara2022/08/24 11:52 AM
                      Mill and ItaniumDoug S2022/08/24 09:31 AM
                        Mill and ItaniumKara2022/08/24 11:49 AM
                          Branch prediction is critical!Rayla2022/08/24 12:02 PM
                          Mill and ItaniumDoug S2022/08/24 01:23 PM
                            Mill and ItaniumKara2022/08/24 01:57 PM
                              Mill and ItaniumKara2022/08/24 02:00 PM
                                Ever heard of loops in code?Heikki Kultala2022/08/24 02:04 PM
                                  Ever heard of loops in code?Kara2022/08/24 02:10 PM
                                  Ever heard of loops in code?Marcus2022/08/25 12:50 PM
                                    Ever heard of loops in code?Anon2022/08/25 02:42 PM
                                      Ever heard of loops in code?Rayla2022/08/25 03:26 PM
                                      Ever heard of loops in code?rwessel2022/08/25 04:20 PM
                                        Ever heard of loops in code?dmcq2022/08/25 05:17 PM
                                        Ever heard of loops in code?anon22022/08/25 09:22 PM
                                          Loop instructionsPaul A. Clayton2022/08/26 06:29 AM
                                            Loop instructions don't work on OoOWilco2022/08/26 06:48 AM
                                              Loop instructions don't work on OoOAnon2022/08/26 06:54 AM
                                                Loop instructions don't work on OoO---2022/08/26 09:49 PM
                                                  Loop instructions don't work on OoOrwessel2022/08/26 11:00 PM
                                                  Loop instructions don't work on OoOdmcq2022/08/27 04:03 AM
                                                  Loop instructions don't work on OoOAnon2022/08/27 04:34 AM
                                            Loop instructionsEtienne2022/08/27 09:12 AM
                                              Loop instructionsrwessel2022/08/27 09:48 AM
                                                Beautifully said. (NT)Kara2022/08/29 04:34 AM
                                      Ever heard of loops in code?Doug S2022/08/25 09:48 PM
                                        Ever heard of loops in code?Anon2022/08/25 10:24 PM
                                          So useful that it is never usedWilco2022/08/26 05:05 AM
                                            So useful that it is never usedAnon2022/08/26 05:16 AM
                                              So useful that it is never usedrwessel2022/08/26 05:27 AM
                                                So useful that it is never usedgallier22022/08/28 11:18 PM
                                                  So useful that it is never usedAnon2022/08/28 11:37 PM
                                                    My 66000 has ENTERPaul A. Clayton2022/08/29 04:29 AM
                                                      My 66000 has ENTERAdrian2022/08/29 06:30 AM
                                                        My 66000 has ENTERdmcq2022/08/29 06:40 AM
                                                          My 66000 has ENTERAdrian2022/08/29 07:52 AM
                                                            My 66000 has ENTERMichael S2022/08/29 08:17 AM
                                                              My 66000 has ENTERAdrian2022/08/29 09:11 AM
                                                        My 66000 has ENTERPaul A. Clayton2022/08/29 05:39 PM
                                                      My 66000 has ENTERAnon2022/08/29 01:33 PM
                                                        Getting My 66000 Principles of OperationPaul A. Clayton2022/08/29 05:53 PM
                                            So useful that it is never used---2022/08/26 10:38 AM
                                              My 66000 VVMMarcus2022/08/27 01:40 AM
                                              So useful that it is never usedWilco2022/08/28 10:53 AM
                                                So useful that it is never usedAnon2022/08/28 11:25 AM
                                                  So useful that it is never useddmcq2022/08/28 12:35 PM
                                                    So useful that it is never usedAnon2022/08/28 01:47 PM
                                                      So useful that it is never useddmcq2022/08/29 02:05 AM
                                                        So useful that it is never usedAnon2022/08/29 02:17 AM
                                                          So useful that it is never useddmcq2022/08/29 06:49 AM
                                                  So useful that it is never usedWilco2022/08/28 01:25 PM
                                                    So useful that it is never used---2022/08/28 02:03 PM
                                                      So useful that it is never usedRayla2022/08/28 02:27 PM
                                                        So useful that it is never used---2022/08/29 02:03 PM
                                                          So useful that it is never usedRayla2022/08/29 07:58 PM
                                                            So useful that it is never used---2022/08/29 08:15 PM
                                                      Clarification about CnC BPU testsChester2022/08/28 07:31 PM
                                                        Clarification about CnC BPU tests---2022/08/29 10:19 AM
                                                          Clarification about CnC BPU testsChester2022/08/29 09:39 PM
                                                            Clarification about CnC BPU tests---2022/08/30 11:34 AM
                                                              Clarification about CnC BPU testsChester2022/08/30 10:57 PM
                                                                Clarification about CnC BPU testsAnon2022/08/30 11:04 PM
                                                                  Clarification about CnC BPU testsChester2022/08/31 07:36 AM
                                                                    Clarification about CnC BPU testsJakob2022/08/31 07:50 AM
                                                                      Clarification about CnC BPU testsAnon2022/08/31 02:29 PM
                                                                        Clarification about CnC BPU testsJakob2022/09/01 02:46 AM
                                                                  Clarification about CnC BPU tests---2022/08/31 12:20 PM
                                                        Clarification about CnC BPU tests---2022/08/30 06:23 PM
                                                          Clarification about CnC BPU testsanon22022/08/30 08:03 PM
                                                            Clarification about CnC BPU testsAnon2022/08/30 08:49 PM
                                                              Clarification about CnC BPU testsanon22022/08/31 12:56 AM
                                                                Clarification about CnC BPU testsAnon2022/08/31 01:16 AM
                                                                  Clarification about CnC BPU testsanon22022/08/31 01:54 AM
                                                              Clarification about CnC BPU tests---2022/08/31 11:31 AM
                                                                Clarification about CnC BPU testsAnon2022/08/31 03:53 PM
                                                                  Clarification about CnC BPU tests---2022/08/31 04:18 PM
                                                          Clarification about CnC BPU testsChester2022/08/30 08:21 PM
                                                            Clarification about CnC BPU testsanon22022/08/31 01:14 AM
                                                            Clarification about CnC BPU tests---2022/08/31 12:18 PM
                                                              Clarification about CnC BPU testsChester2022/09/01 10:03 AM
                                                                Clarification about CnC BPU tests---2022/09/01 04:26 PM
                                                                  Clarification about CnC BPU testsChester2022/09/02 01:59 AM
                                                                    Clarification about CnC BPU tests---2022/09/02 10:17 AM
                                                                      A final roundup of what we've learned---2022/09/03 10:13 AM
                                                                        A final roundup of what we've learnedChester2022/09/03 11:59 PM
                                                                          AMD L2 BTBiz2022/09/04 01:44 AM
                                                                            AMD L2 BTB (and small correction)Chester2022/09/04 03:11 AM
                                                                              AMD L2 BTB (and small correction)---2022/09/04 03:53 PM
                                                                          A final roundup of what we've learned---2022/09/04 03:46 PM
                                                                            Ha ha - sometimes your ideas are upturned on closer examination!---2022/09/04 10:03 PM
                                                                              Ha ha - sometimes your ideas are upturned on closer examination!Chester2022/09/05 02:11 AM
                                                                                Ha ha - sometimes your ideas are upturned on closer examination!---2022/09/05 09:51 AM
                                                                                Ha ha - sometimes your ideas are upturned on closer examination!Rayla2022/09/05 10:00 AM
                                                                                  Ha ha - sometimes your ideas are upturned on closer examination!---2022/09/05 02:12 PM
                                                                            A final roundup of what we've learnedChester2022/09/05 12:51 AM
                                                                              A final roundup of what we've learnedAnon2022/09/05 01:03 AM
                                                                              A final roundup of what we've learnedanon2022/09/05 03:43 AM
                                                                                A final roundup of what we've learned---2022/09/05 10:13 AM
                                                                                  A final roundup of what we've learnedDoug S2022/09/05 01:06 PM
                                                                                    A final roundup of what we've learned---2022/09/05 02:08 PM
                                                                                      A final roundup of what we've learnedanon2022/09/06 01:30 AM
                                                                                        A final roundup of what we've learnedDoug S2022/09/06 01:27 PM
                                                                                      A final roundup of what we've learnedDoug S2022/09/06 09:47 AM
                                                                                        A final roundup of what we've learnedblaine2022/09/06 01:16 PM
                                                                              A final roundup of what we've learned---2022/09/05 10:06 AM
                                      Ever heard of loops in code?avianes2022/08/26 02:05 AM
                                        Ever heard of loops in code?iz2022/08/26 02:54 AM
                                          Ever heard of loops in code?Michael S2022/08/26 03:30 AM
                                            You forgot C7x :) (NT)none2022/08/26 06:09 AM
                                              I never heard about them (NT)Michael S2022/08/26 08:51 AM
                                                They are pretty kick ass (NT)Marcus2022/08/26 12:00 PM
                                                I never heard about themnone2022/08/26 02:24 PM
                                                  I never heard about themMarcus2022/08/27 12:03 AM
                                                    I never heard about themnone2022/08/27 02:57 AM
                                                      I never heard about themMichael S2022/08/27 11:57 AM
                                                        I never heard about themnone2022/09/01 09:18 AM
                                            Ever heard of loops in code?iz2022/08/26 02:07 PM
                              COFJakob2022/08/26 02:00 AM
                    Is there an end of OoO scaling?Marcus2022/08/24 09:05 AM
                      Is there an end of OoO scaling?Kara2022/08/24 11:38 AM
                        Is there an end of OoO scaling?Heikki Kultala2022/08/24 02:02 PM
                        Is there an end of OoO scaling?Heikki Kultala2022/08/24 02:02 PM
              Mill people are working on itMarcus2022/08/22 11:57 AM
            Some may be mostly retiredMark Roulo2022/08/22 09:47 AM
              Some may be mostly retiredPaul A. Clayton2022/08/22 04:36 PM
              Some may be mostly retiredMarcus2022/08/22 09:40 PM
                Some may be mostly retiredMichael S2022/08/23 12:31 AM
            Mill people are working on itAndrew Clough2022/08/22 10:49 AM
              Mill people are working on itpeceed2022/08/23 06:20 AM
                Mill people are working on itAndrew Clough2022/08/23 06:29 AM
                  Mill people are working on itdmcq2022/08/23 07:59 AM
                    Mill people are working on itKara2022/08/23 08:38 AM
                      Mill people are working on itdmcq2022/08/23 10:07 AM
                        CPU and DSP are not mechanismsHeikki Kultala2022/08/23 10:21 PM
                          CPU and DSP are not mechanismsdmcq2022/08/24 03:17 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊