or patents

By: Chester (lamchester.delete@this.gmail.com), August 30, 2022 7:29 pm
Room: Moderated Discussions
> > Even Sandy Bridge does [Fetch Address Predict] -> Queue of predicted addresses -> [Fetch cache Access]
> > -> Queue of Instructions -> [Decode], as branch latency doesn't substantially increase as the loop exceeds
> > L1i capacity. It so happens that Apple using the former method doesn't matter much - they have such a
> > honking huge L1i that the miss case is likely substantially rarer than on other architectures.
>
> What you say about Sandy Bridge is interesting, but with deep enough queues in various places (especially the
> obvious queue of *basic* Decoupled Fetch, between the cache access and decode) L2 access can be hidden.
> Not to mention that basic sequential or similar I-prefetching might be doing the job...
>
> It is very difficult to be quite sure what buffering queue is hiding latency. At the very least
> I'd want to be sure that the more obvious candidates suggested above were not in play.
>
> My guessed analysis (based on some Apple knowledge, but zero Intel knowledge)

I seriously suggest looking into Intel and AMD architecture. Their CPUs are extensively used in supercomputers and other performance critical applications. So, their performance characteristics are very well studied, there's plenty of good microarchitecture documentation, and the CPUs themselves have extensive performance monitoring facilities.

If you look at the Hot Chips presentation for AMD's Piledriver, you can see [BTB] -> [Prediction Queue] -> [ICache] -> [Fetch Queue] -> [Decoders] drawn in the block diagram. Slide bullet points say "decoupled predict and fetch pipelines" and "prediction-directed instruction prefetch". It's pretty clear what's going on. Sometimes, they'll even disclose the size of certain queues. For example, for the fetch queue, AMD's Zen 2 optimization manual straight up states that after bytes are fetched from L1i, "the fetch unit sends these bytes to the decode unit through a 20 entry Instruction Byte Queue (IBQ), each entry holding 16 instruction bytes....the IBQ acts as a decoupling queue between the fetch/branch-predict unit and the decode unit."

Hearing exactly what techniques are being used gets you a lot farther than looking at patents using the densest language possible to describe what people have already been doing for the past 20 years (if not past 30 years) in an attempt to get past an overworked patent office. Then, using those patents to claim some technique's being used in a largely undocumented architecture is a tad less useful than saying nothing at all. If you do want to claim XYZ architecture does something, you need to back your claim with test results if the designers didn't straight up say they were doing it.

> is that Apple have split I-Prefetch into two distinct parts
> - one is "long range" prefetch, which is triggered and governed essentially by calls and
> returns. Call history is used to drive this, but it will not kick in simply for very long
> single functions.

I don't see why call history is particularly different. In the end it's just a fetch target provided by the BPU. It can come via
- incrementing off the last fetch pointer, if no taken branch is predicted to be there
- the regular BTB, if there's a predicted taken branch
- the return stack, if there's a predicted taken branch that's a return
- a separate indirect target array, if there's an indirect branch (could be the regular BTB too depending on the exact microarchitecture)

The fetch unit doesn't particularly care how the BPU came up with the address. It just gets an address to look up in the L1i and possibly send a request to L2 for.

> In the academic literature this is similar to something called RDIP.
> - the second looks something like basic FDIP, driven by Fetch decoupled from Decode, so that Fetch
> executes a long way ahead of Decode and the queue between Fetch and Decode can cover a miss to
> L2. There probably is some slight augmentation here (like right after an RDIP jump, not just the
> target line but a few successor lines are also pulled in, but I know nothing of this)
> What this ultimately means is that, if the long branch loops are basically sequential
> I-cache accesses, an old-school sequential I-cache prefetcher will behave very differently
> from Apple's I-prefetcher because Apple does not seem to have such a sequential I-prefetcher
> generically (though probably does immediately after a long distance jump).
>
>
> The nice thing about the second decoupling queue is that it allows for more substantial hiding of more "random"
> bouncing around inside a footprint too large for L1 but fitting in L2, which is why I suspect Apple will gravitate
> towards it in time, but has not yet prioritized it.

Both would help hide L1i miss latency, in different ways.
- Queue between BPU and fetch: allows fetch unit to queue up more L1i fill requests, following the instruction flow through taken branches. If deep enough, it can queue up enough requests to hide L2 latency, at which point taken branch throughput is only limited by BTB latency (you don't branch instruction bytes to show up at the decoder to know where it goes).
- Queue between fetch and decode: insulates the decoder from fetch stalls. If you have more fetch than decode bandwidth (often the case), queued up bytes can keep feeding decode.

> Of course such a scheme would benefit smaller I-caches even
> more, but in the literature I don't get the feeling that any designs have implemented this yet.

Yeah, which is why AMD/Intel tend to benefit more from it. They can sustain pretty high IPC even when running code out of L2. Or in AMD's case, even out of L3. Apple doesn't need to do that because they have a much bigger L1i, and don't have to care as much about the L1i miss case. Apple's design is also better for energy efficiency, since fetching code from L2 involves data movement over a more power hungry path.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureNobod2022/08/27 09:21 AM
  Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureRayla2022/08/27 10:35 AM
    Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureKara2022/08/27 11:04 AM
      Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureKara2022/08/27 11:05 AM
    Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureBjörn Ragnar Björnsson2022/08/27 11:07 AM
      Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureKara2022/08/27 11:18 AM
        Typo, I meant Like nv denver (NT)Kara2022/08/27 11:19 AM
        Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureRayla2022/08/27 12:06 PM
        Chips & Cheese analyzes Tachyum’s Revised Prodigy Architectureavianes2022/08/28 05:59 AM
          Coarser-grained checkpointing/trackingPaul A. Clayton2022/08/28 09:56 AM
            Coarser-grained checkpointing/trackingavianes2022/08/29 05:02 AM
    Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureKara2022/08/27 11:21 AM
      Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureRayla2022/08/27 12:04 PM
        Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureBjörn Ragnar Björnsson2022/08/27 12:30 PM
  Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureAnon2022/08/27 09:54 PM
    Chips & Cheese analyzes Tachyum’s Revised Prodigy Architectureavianes2022/08/28 02:38 AM
    Chips & Cheese analyzes Tachyum’s Revised Prodigy Architecture---2022/08/28 02:24 PM
      Chips & Cheese analyzes Tachyum’s Revised Prodigy Architectureanon22022/08/28 03:14 PM
        Energy cost of fetch width?Paul A. Clayton2022/08/28 05:50 PM
          It's not about width in absolute bits. It's about duty cycleHeikki Kultala2022/08/29 02:28 PM
        Chips & Cheese analyzes Tachyum’s Revised Prodigy Architecture---2022/08/29 09:53 AM
          Chips & Cheese analyzes Tachyum’s Revised Prodigy Architectureanon22022/08/29 02:26 PM
        Chips & Cheese analyzes Tachyum’s Revised Prodigy Architecture---2022/08/29 10:11 AM
          Chips & Cheese analyzes Tachyum’s Revised Prodigy Architectureanon22022/08/29 04:00 PM
            or patentsChester2022/08/29 09:54 PM
              or patentsanon22022/08/29 10:54 PM
                or patentsChester2022/08/29 11:37 PM
                  or patentsAnon2022/08/29 11:46 PM
                  or patentsanon22022/08/30 01:35 AM
                    or patentsChester2022/08/30 02:07 PM
              or patents---2022/08/30 11:29 AM
                or patentsChester2022/08/30 07:29 PM
                  or patents---2022/08/31 10:44 AM
                    or patentsUngo2022/08/31 01:10 PM
                      or patents---2022/08/31 04:01 PM
                        or patentsChester2022/08/31 07:05 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊