Chips & Cheese analyzes Tachyum’s Revised Prodigy Architecture

By: --- (---.delete@this.redheron.com), August 29, 2022 10:11 am
Room: Moderated Discussions
anon2 (anon.delete@this.anon.com) on August 28, 2022 3:14 pm wrote:
> --- (---.delete@this.redheron.com) on August 28, 2022 2:24 pm wrote:
> > Anon (no.delete@this.spam.com) on August 27, 2022 9:54 pm wrote:
> > > Nobod (Nobod.delete@this.nospam.com) on August 27, 2022 9:21 am wrote:
> > > > Chips & Cheese analyzes Tachyum’s Revised Prodigy Architecture
> > > >
> > > > The new architecture is more traditional and more likely to work. Unfortunately it is trying to address
> > > > both HPC and datacenter server markets, but isn’t better than the alternatives at either job.
> > >
> > > Did anyone noticed the front end? Some days ago we were discussing about variable length instructions
> > > and too much fetch width, well, Tachyum thinks it is a good idea to have instruction 4 or
> > > 8 bytes wide, at 8 wide decode the instruction fetch is... 128 bytes, yep, 64 bytes would
> > > be enough in the worst case, but they think 128 bytes per cycle is better, I won't say that
> > > what Tachyum thinks matter at all, but I think this was one interesting point.
> > >
> >
> > In and of itself that is not too startling. Apple's Fetch width is probably
> > 16 instructions (at maximum), so 64B, and can straddle two cache lines.
>
> How probable would you say this is? What do you base it on?
>
> > Of course that's hooked up to what's meant to be a non-tiny,
> > impressive core, not a weirdly unbalanced design.
> >
> > It wouldn't be absolutely crazy if you're trying to save energy (I wouldn't roll my eyes if
> > I learned that Apple's small core likewise can Fetch up to 16 instructions a cycle -- might
> > as well get as much useful as you can in one gulp, then sleep Fetch for two or three cycles);
>
> That seems like the opposite of good energy efficiency to me. I doubt the small
> core would do that and also surprised about the big core if that is true of it.
>

You clearly have no clue the lengths Apple go to to save energy in Fetch.
The I-cache SRAMs are very differently designed from the D-cache SRAMs because they are accessed in a very different pattern. TLB-lookup and way are memorized across cycles (as is tag comparison) so that TLB-lookup and way "calculation/prediction/whatever" don't have to be repeated.
As always I suspect you are using a mental model of what Fetch looked like on a 2000s Intel processor and assuming that's the world.
Remember that, as a basic example,
(a) Apple Fetch prediction predicts the trace width, not just the trace address. So that superfluous extra instructions are not loaded.
(b) IF you know the trace width, then why not load as much as you can from a cache line? You hav to pay the word line and sense amp costs (along with the various logic costs) for the first load, why not amortize as much data as possible over those fixed costs?

And don't get me started on the cascade that loops propagate through (from loop buffer to [small] trace cache to L0 I-cache, each using slightly more energy, but allowing for slightly more sophisticated loops, and all able to avoid the primary costs of I-cache lookup and all to much of branch/fetch prediction).

Here's a sample of the lengths they go to to reduce I-fetch energy:

(Some of the *precise* details in the older patents are no longer relevant, but all build on each other)

(2010) https://patents.google.com/patent/US8914580B2 Reducing cache power consumption for sequential accesses

(2013) https://patents.google.com/patent/US9311098B2 Mechanism for reducing cache power consumption using cache way prediction

(2013) https://patents.google.com/patent/US10901484B2 Fetch predition [sic] circuit for reducing power consumption in a processor

(2016) https://patents.google.com/patent/US10203959B1 Subroutine power optimiztion [sic]


BTW In spite of Apple Fetch being so advanced, they actually remains a lot they can still do! While they were early adopters of Decoupled Fetch (as has I think, *everyone* nowadays, eventually), they only have "first stage" decoupling, withe the pipeline looking like
[Fetch Address Predict] -> [Fetch cache Access] -> Queue of Instructions -> [Decode].
They have not adopted the next step (neither has anyone else yet?), as suggested in Glenn Reinmann's thesis, of
[Fetch Address Predict] -> Queue of predicted addresses -> [Fetch cache Access] -> Queue of Instructions -> [Decode].

Doing this would allow them to use a substantially larger L2 Fetch Predictor that took 2 or 3 cycles and covered most of L2, not just L1, to be absorbed invisibly by the first queue, and could also be used as the basis of some I-prefetching.
My guess is we will see this in time once serious optimization of Apple Silicon macs starts and substantially larger I-footprints become a concern to some use cases.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureNobod2022/08/27 09:21 AM
  Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureRayla2022/08/27 10:35 AM
    Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureKara2022/08/27 11:04 AM
      Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureKara2022/08/27 11:05 AM
    Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureBjörn Ragnar Björnsson2022/08/27 11:07 AM
      Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureKara2022/08/27 11:18 AM
        Typo, I meant Like nv denver (NT)Kara2022/08/27 11:19 AM
        Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureRayla2022/08/27 12:06 PM
        Chips & Cheese analyzes Tachyum’s Revised Prodigy Architectureavianes2022/08/28 05:59 AM
          Coarser-grained checkpointing/trackingPaul A. Clayton2022/08/28 09:56 AM
            Coarser-grained checkpointing/trackingavianes2022/08/29 05:02 AM
    Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureKara2022/08/27 11:21 AM
      Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureRayla2022/08/27 12:04 PM
        Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureBjörn Ragnar Björnsson2022/08/27 12:30 PM
  Chips & Cheese analyzes Tachyum’s Revised Prodigy ArchitectureAnon2022/08/27 09:54 PM
    Chips & Cheese analyzes Tachyum’s Revised Prodigy Architectureavianes2022/08/28 02:38 AM
    Chips & Cheese analyzes Tachyum’s Revised Prodigy Architecture---2022/08/28 02:24 PM
      Chips & Cheese analyzes Tachyum’s Revised Prodigy Architectureanon22022/08/28 03:14 PM
        Energy cost of fetch width?Paul A. Clayton2022/08/28 05:50 PM
          It's not about width in absolute bits. It's about duty cycleHeikki Kultala2022/08/29 02:28 PM
        Chips & Cheese analyzes Tachyum’s Revised Prodigy Architecture---2022/08/29 09:53 AM
          Chips & Cheese analyzes Tachyum’s Revised Prodigy Architectureanon22022/08/29 02:26 PM
        Chips & Cheese analyzes Tachyum’s Revised Prodigy Architecture---2022/08/29 10:11 AM
          Chips & Cheese analyzes Tachyum’s Revised Prodigy Architectureanon22022/08/29 04:00 PM
            or patentsChester2022/08/29 09:54 PM
              or patentsanon22022/08/29 10:54 PM
                or patentsChester2022/08/29 11:37 PM
                  or patentsAnon2022/08/29 11:46 PM
                  or patentsanon22022/08/30 01:35 AM
                    or patentsChester2022/08/30 02:07 PM
              or patents---2022/08/30 11:29 AM
                or patentsChester2022/08/30 07:29 PM
                  or patents---2022/08/31 10:44 AM
                    or patentsUngo2022/08/31 01:10 PM
                      or patents---2022/08/31 04:01 PM
                        or patentsChester2022/08/31 07:05 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊