or patents

By: Chester (lamchester.delete@this.gmail.com), August 29, 2022 11:37 pm
Room: Moderated Discussions
> POWER9 seems to do similar. I'm sure the PA6T PowerPC heritage link is purely coincidental. From OpenPOWER user
> manual: "As instructions are fetched, they are scanned for branches. Up to eight branches are simultaneously
> processed by the branch prediction logic that predicts both the direction and/or target of the branches, depending
> on the branch type." It does have some kind of L0 predictor (BTAC) which does predict addresses as well though,
> unclear how works but it is very fast so not like huge multi level target buffer of x86 CPUs.

Seems common to have both. For example, Cortex A72 seems to have a decoupled 64 entry L1 BTB. After that, it seems to have a second level BTB coupled to the L1i, that can track up to 4096 branch targets as the branches don't spill out of the 48 KB L1i. So if there's a branch every 16 bytes, it can track 3072 targets. Or 768 if there's a branch every 64 bytes, and so on.

If a branch target comes out of the L1 BTB, there's a one cycle penalty. If it comes out of the second level BTB/L1i, there's a 2 cycle penalty.

> It makes sense that Apple really likes a very large I$ if they are doing coupled fetch. And coupled
> fetch has real benefits, you don't have to predict the presence of a branch, and you don't have
> to predict target for direct branches (which should be the large majority).

You still have to predict the presence of a branch and predict that it's going to be taken. The advantage is you don't have to index into a separate BTB structure to fetch the branch target if you predict it's taken. You get the predicted branch target along with the L1i fetch, no other lookup needed.

The downside is you have no clue where to go next once you miss L1i. Branches are pretty common, so if there is a taken branch coming up, you won't be able to follow it. With a decoupled BTB, you can still index into that even if your L1i fetch missed, and prefetch far enough to cover L2 latency - assuming your branch predictor is reasonably accurate.

> This allow equivalent
> accuracy with smaller structures. Downside being you don't get I$ prefetch prediction from the same
> structure, and you couldn't avoid taken branch bubbles with high frequency. But that doesn't mean
> you can't have a prefetch prediction from other structures, as perhaps POWER9 does.

Yeah, branch targets stored alongside L1i (coupled BTB) tend to take more than 1 cycle latency to access.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureNobod2022/08/27 09:21 AM
  Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureRayla2022/08/27 10:35 AM
    Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureKara2022/08/27 11:04 AM
      Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureKara2022/08/27 11:05 AM
    Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureBjörn Ragnar Björnsson2022/08/27 11:07 AM
      Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureKara2022/08/27 11:18 AM
        Typo, I meant Like nv denver (NT)Kara2022/08/27 11:19 AM
        Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureRayla2022/08/27 12:06 PM
        Chips & Cheese analyzes Tachyum’s Revised Prodigy Architectureavianes2022/08/28 05:59 AM
          Coarser-grained checkpointing/trackingPaul A. Clayton2022/08/28 09:56 AM
            Coarser-grained checkpointing/trackingavianes2022/08/29 05:02 AM
    Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureKara2022/08/27 11:21 AM
      Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureRayla2022/08/27 12:04 PM
        Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureBjörn Ragnar Björnsson2022/08/27 12:30 PM
  Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureAnon2022/08/27 09:54 PM
    Chips & Cheese analyzes Tachyum’s Revised Prodigy Architectureavianes2022/08/28 02:38 AM
    Chips & Cheese analyzes Tachyum’s Revised Prodigy Architecture---2022/08/28 02:24 PM
      Chips & Cheese analyzes Tachyum’s Revised Prodigy Architectureanon22022/08/28 03:14 PM
        Energy cost of fetch width?Paul A. Clayton2022/08/28 05:50 PM
          It's not about width in absolute bits. It's about duty cycleHeikki Kultala2022/08/29 02:28 PM
        Chips & Cheese analyzes Tachyum’s Revised Prodigy Architecture---2022/08/29 09:53 AM
          Chips & Cheese analyzes Tachyum’s Revised Prodigy Architectureanon22022/08/29 02:26 PM
        Chips & Cheese analyzes Tachyum’s Revised Prodigy Architecture---2022/08/29 10:11 AM
          Chips & Cheese analyzes Tachyum’s Revised Prodigy Architectureanon22022/08/29 04:00 PM
            or patentsChester2022/08/29 09:54 PM
              or patentsanon22022/08/29 10:54 PM
                or patentsChester2022/08/29 11:37 PM
                  or patentsAnon2022/08/29 11:46 PM
                  or patentsanon22022/08/30 01:35 AM
                    or patentsChester2022/08/30 02:07 PM
              or patents---2022/08/30 11:29 AM
                or patentsChester2022/08/30 07:29 PM
                  or patents---2022/08/31 10:44 AM
                    or patentsUngo2022/08/31 01:10 PM
                      or patents---2022/08/31 04:01 PM
                        or patentsChester2022/08/31 07:05 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊