By: Marcus (m.delete@this.bitsnbites.eu), August 6, 2022 4:36 am
Room: Moderated Discussions
Brett (ggtgp.delete@this.yahoo.com) on August 5, 2022 1:17 pm wrote:
> Brett (ggtgp.delete@this.yahoo.com) on August 3, 2022 4:31 pm wrote:
> > --- (---.delete@this.redheron.com) on August 3, 2022 2:55 pm wrote:
> > > Adrian (a.delete@this.acm.org) on August 3, 2022 11:33 am wrote:
> > >
> > > What is the problem you are trying to solve? You want to shrink the average size
> > > of an instruction from 4 bytes to, I don't know, 3.2 bytes? Why is that worth doing?
> > > Or, if worth doing, not worth doing via something like CodePack instead?
> > >
> > > If you insist on stacks (even 8 stacks) you give up on register renaming and all that implies
> > > for extreme OoO (hundreds of instruction in size), extreme width, and extreme speculation.
> > > And if your answer is "well I'll implement the top few registers of the stacks as renamed
> > > registers", well, then, what exactly are we simplifying by not using registers directly?
> > >
> > > I understand the impulse; we're all irritated by the ISA bits that are burned by having
> > > to specify each of multiple registers. But ultimately, that's what computing IS -- specifying
> > > moving data. In a way, your viewpoint is like people still insisting that computing is
> > > about adds and multiplies, and everything else is "overhead". NO, it's NOT!
> > > There is some specialized computing that is about adds and multiplies, yes. But
> > > general computing is as much or more about moving data around (with everything
> > > that implies about specifying addresses) as it is about adds and multiplies.
> > >
> > > Instruction Fetch and Prefetch are SOLVED problems. We know the numbers.
> > > Even with a 32KB cache, essentially perfect Prefetch or equivalents (like a perfect L1I) are
> > > worth about 25% (details vary depending on exact details of core width, L2 parameters, etc).
> > > Existing practical prefetchers like RDIP get you about 20%.
> > > Experimental prefetchers like D-JOLT (easy to implement
> > > if, like Apple, you already have an RDIP-like infrastructure)
> > > get you a few percent closer to perfection. Even
> > > better is something like an Entangled I-prefetcher, but that requires building a new infrastructure.
> > > Point is we know how to avoid basically *all* the costs of I-prefetch
> > > - Decoupled Fetch (check for Apple, ARM Ltd, and I expect recent x86)
> > > - Decoupled Address generation (next step, no-one is doing
> > > this yet, not even Apple, but it's "not hard" (hah!)
> > > - something like D-JOLT to handle "long-distance" Prefetch.
> > >
> > > What remains a problem is not the presence (or not) of instructions in the L1I; it is
> > > the presence (or not) of control flow data relevant to those new instructions...
> > > That's why IBM z/ devotes utterly insane amounts of area to their L2 BTB. And it's why I think Decoupled
> > > Address generation is the obvious next step for Apple and then everyone else (Decoupled Address generation
> > > allows for the occasional access of an L2 BTB taking 2 or 3 cycles rather than single cycle).
> > > But making the ISA denser helps with none of this!
> >
> > I would make this proposal 8 address registers and 4 stacks, as that is what is going to happen.
> > The goal is to split the rename in half, addressing and data, and this also would help in read/write ports.
> >
> > The ghost of the 68000 may live again. ;)
> > Except not sucky and brain damaged like the 68000. ;)
>
> How the New68k architecture will first show up as an x86 extension. ;)
>
Have you seen the 68080? It's a 64-bit 4-wide OoO 68k CPU with more registers and instructions than the 32-bit 68k line (68000-68060). It's implemented in an FPGA. Pretty impressive IMO.
http://www.apollo-core.com/index.htm?page=coding&tl=1
/Marcus
> Brett (ggtgp.delete@this.yahoo.com) on August 3, 2022 4:31 pm wrote:
> > --- (---.delete@this.redheron.com) on August 3, 2022 2:55 pm wrote:
> > > Adrian (a.delete@this.acm.org) on August 3, 2022 11:33 am wrote:
> > >
> > > What is the problem you are trying to solve? You want to shrink the average size
> > > of an instruction from 4 bytes to, I don't know, 3.2 bytes? Why is that worth doing?
> > > Or, if worth doing, not worth doing via something like CodePack instead?
> > >
> > > If you insist on stacks (even 8 stacks) you give up on register renaming and all that implies
> > > for extreme OoO (hundreds of instruction in size), extreme width, and extreme speculation.
> > > And if your answer is "well I'll implement the top few registers of the stacks as renamed
> > > registers", well, then, what exactly are we simplifying by not using registers directly?
> > >
> > > I understand the impulse; we're all irritated by the ISA bits that are burned by having
> > > to specify each of multiple registers. But ultimately, that's what computing IS -- specifying
> > > moving data. In a way, your viewpoint is like people still insisting that computing is
> > > about adds and multiplies, and everything else is "overhead". NO, it's NOT!
> > > There is some specialized computing that is about adds and multiplies, yes. But
> > > general computing is as much or more about moving data around (with everything
> > > that implies about specifying addresses) as it is about adds and multiplies.
> > >
> > > Instruction Fetch and Prefetch are SOLVED problems. We know the numbers.
> > > Even with a 32KB cache, essentially perfect Prefetch or equivalents (like a perfect L1I) are
> > > worth about 25% (details vary depending on exact details of core width, L2 parameters, etc).
> > > Existing practical prefetchers like RDIP get you about 20%.
> > > Experimental prefetchers like D-JOLT (easy to implement
> > > if, like Apple, you already have an RDIP-like infrastructure)
> > > get you a few percent closer to perfection. Even
> > > better is something like an Entangled I-prefetcher, but that requires building a new infrastructure.
> > > Point is we know how to avoid basically *all* the costs of I-prefetch
> > > - Decoupled Fetch (check for Apple, ARM Ltd, and I expect recent x86)
> > > - Decoupled Address generation (next step, no-one is doing
> > > this yet, not even Apple, but it's "not hard" (hah!)
> > > - something like D-JOLT to handle "long-distance" Prefetch.
> > >
> > > What remains a problem is not the presence (or not) of instructions in the L1I; it is
> > > the presence (or not) of control flow data relevant to those new instructions...
> > > That's why IBM z/ devotes utterly insane amounts of area to their L2 BTB. And it's why I think Decoupled
> > > Address generation is the obvious next step for Apple and then everyone else (Decoupled Address generation
> > > allows for the occasional access of an L2 BTB taking 2 or 3 cycles rather than single cycle).
> > > But making the ISA denser helps with none of this!
> >
> > I would make this proposal 8 address registers and 4 stacks, as that is what is going to happen.
> > The goal is to split the rename in half, addressing and data, and this also would help in read/write ports.
> >
> > The ghost of the 68000 may live again. ;)
> > Except not sucky and brain damaged like the 68000. ;)
>
> How the New68k architecture will first show up as an x86 extension. ;)
>
Have you seen the 68080? It's a 64-bit 4-wide OoO 68k CPU with more registers and instructions than the 32-bit 68k line (68000-68060). It's implemented in an FPGA. Pretty impressive IMO.
http://www.apollo-core.com/index.htm?page=coding&tl=1
/Marcus