or patents

By: --- (---.delete@this.redheron.com), August 31, 2022 10:44 am
Room: Moderated Discussions
Chester (lamchester.delete@this.gmail.com) on August 30, 2022 7:29 pm wrote:
> > > Even Sandy Bridge does [Fetch Address Predict] -> Queue of predicted addresses -> [Fetch cache Access]
> > > -> Queue of Instructions -> [Decode], as branch latency doesn't substantially increase as the loop exceeds
> > > L1i capacity. It so happens that Apple using the former method doesn't matter much - they have such a
> > > honking huge L1i that the miss case is likely substantially rarer than on other architectures.
> >
> > What you say about Sandy Bridge is interesting, but with
> > deep enough queues in various places (especially the
> > obvious queue of *basic* Decoupled Fetch, between the cache access and decode) L2 access can be hidden.
> > Not to mention that basic sequential or similar I-prefetching might be doing the job...
> >
> > It is very difficult to be quite sure what buffering queue is hiding latency. At the very least
> > I'd want to be sure that the more obvious candidates suggested above were not in play.
> >
> > My guessed analysis (based on some Apple knowledge, but zero Intel knowledge)
>
> I seriously suggest looking into Intel and AMD architecture. Their CPUs are extensively
> used in supercomputers and other performance critical applications. So, their performance
> characteristics are very well studied, there's plenty of good microarchitecture documentation,
> and the CPUs themselves have extensive performance monitoring facilities.
>
> If you look at the Hot Chips presentation for AMD's Piledriver, you can see [BTB] -> [Prediction Queue] ->
> [ICache] -> [Fetch Queue] -> [Decoders] drawn in the block diagram. Slide bullet points say "decoupled predict

Can you provide a link to these slides or a paper? I can find references to a Hot Chips presentation for Steamroller in 2012 and Bulldozer in 2011, but nothing to Piledriver.

I think what you are referring to is:
https://old.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.19.9-Desktop-CPUs/HC23.19.940-Bulldozer-White-AMD.pdf

That does seem to be what I am describing, yes. Thanks for the reference, I have included it in my PDF.

> and fetch pipelines" and "prediction-directed instruction prefetch". It's pretty clear what's going on. Sometimes,
> they'll even disclose the size of certain queues. For example, for the fetch queue, AMD's Zen 2 optimization
> manual straight up states that after bytes are fetched from L1i, "the fetch unit sends these bytes to the
> decode unit through a 20 entry Instruction Byte Queue (IBQ), each entry holding 16 instruction bytes....the
> IBQ acts as a decoupling queue between the fetch/branch-predict unit and the decode unit."
>
> Hearing exactly what techniques are being used gets you a lot farther than looking at patents using
> the densest language possible to describe what people have already been doing for the past 20 years
> (if not past 30 years) in an attempt to get past an overworked patent office. Then, using those patents
> to claim some technique's being used in a largely undocumented architecture is a tad less useful than
> saying nothing at all. If you do want to claim XYZ architecture does something, you need to back your
> claim with test results if the designers didn't straight up say they were doing it.

That's great. And when you send me the links to the talks and papers where Apple has disclosed this info, I'll be all over them. Until then, what what do you suggest I do?

It's simply not true that everything can be found in papers and slides. Most recently I was asking about that IBM "Tagged Orientation Predictor" mentioned (but not described) in a slide. Rayla was kind enough to send a patent reference which makes the idea very clear.

Yes patents are an imperfect data source, there is no-one on earth who is denying that. But a certain group of people seem to find it a personal affront that some of us can extract data from them fairly easily, whereas others do not find this easy.
I don't get this. It's like being angry that someone else can speak Chinese and you can't – which I guess, come to think of it, does make a lot of people angry...


> > is that Apple have split I-Prefetch into two distinct parts
> > - one is "long range" prefetch, which is triggered and governed essentially by calls and
> > returns. Call history is used to drive this, but it will not kick in simply for very long
> > single functions.
>
> I don't see why call history is particularly different. In the end
> it's just a fetch target provided by the BPU. It can come via
> - incrementing off the last fetch pointer, if no taken branch is predicted to be there
> - the regular BTB, if there's a predicted taken branch
> - the return stack, if there's a predicted taken branch that's a return
> - a separate indirect target array, if there's an indirect branch (could
> be the regular BTB too depending on the exact microarchitecture)
>
> The fetch unit doesn't particularly care how the BPU came up with the address. It
> just gets an address to look up in the L1i and possibly send a request to L2 for.

I can't make head or tail of this. I said Prefetch, you are discussing Fetch.


> Both would help hide L1i miss latency, in different ways.
> - Queue between BPU and fetch: allows fetch unit to queue up more L1i fill requests, following
> the instruction flow through taken branches. If deep enough, it can queue up enough requests
> to hide L2 latency, at which point taken branch throughput is only limited by BTB latency (you
> don't branch instruction bytes to show up at the decoder to know where it goes).

Yes, that (multiple misses lined up) is a good point that I should add to my list.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureNobod2022/08/27 09:21 AM
  Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureRayla2022/08/27 10:35 AM
    Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureKara2022/08/27 11:04 AM
      Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureKara2022/08/27 11:05 AM
    Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureBjörn Ragnar Björnsson2022/08/27 11:07 AM
      Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureKara2022/08/27 11:18 AM
        Typo, I meant Like nv denver (NT)Kara2022/08/27 11:19 AM
        Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureRayla2022/08/27 12:06 PM
        Chips & Cheese analyzes Tachyum’s Revised Prodigy Architectureavianes2022/08/28 05:59 AM
          Coarser-grained checkpointing/trackingPaul A. Clayton2022/08/28 09:56 AM
            Coarser-grained checkpointing/trackingavianes2022/08/29 05:02 AM
    Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureKara2022/08/27 11:21 AM
      Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureRayla2022/08/27 12:04 PM
        Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureBjörn Ragnar Björnsson2022/08/27 12:30 PM
  Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureAnon2022/08/27 09:54 PM
    Chips & Cheese analyzes Tachyum’s Revised Prodigy Architectureavianes2022/08/28 02:38 AM
    Chips & Cheese analyzes Tachyum’s Revised Prodigy Architecture---2022/08/28 02:24 PM
      Chips & Cheese analyzes Tachyum’s Revised Prodigy Architectureanon22022/08/28 03:14 PM
        Energy cost of fetch width?Paul A. Clayton2022/08/28 05:50 PM
          It's not about width in absolute bits. It's about duty cycleHeikki Kultala2022/08/29 02:28 PM
        Chips & Cheese analyzes Tachyum’s Revised Prodigy Architecture---2022/08/29 09:53 AM
          Chips & Cheese analyzes Tachyum’s Revised Prodigy Architectureanon22022/08/29 02:26 PM
        Chips & Cheese analyzes Tachyum’s Revised Prodigy Architecture---2022/08/29 10:11 AM
          Chips & Cheese analyzes Tachyum’s Revised Prodigy Architectureanon22022/08/29 04:00 PM
            or patentsChester2022/08/29 09:54 PM
              or patentsanon22022/08/29 10:54 PM
                or patentsChester2022/08/29 11:37 PM
                  or patentsAnon2022/08/29 11:46 PM
                  or patentsanon22022/08/30 01:35 AM
                    or patentsChester2022/08/30 02:07 PM
              or patents---2022/08/30 11:29 AM
                or patentsChester2022/08/30 07:29 PM
                  or patents---2022/08/31 10:44 AM
                    or patentsUngo2022/08/31 01:10 PM
                      or patents---2022/08/31 04:01 PM
                        or patentsChester2022/08/31 07:05 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊