By: Brett (ggtgp.delete@this.yahoo.com), August 10, 2022 11:59 pm
Room: Moderated Discussions
Brett (ggtgp.delete@this.yahoo.com) on August 5, 2022 1:17 pm wrote:
> Brett (ggtgp.delete@this.yahoo.com) on August 3, 2022 4:31 pm wrote:
> > --- (---.delete@this.redheron.com) on August 3, 2022 2:55 pm wrote:
> > > Adrian (a.delete@this.acm.org) on August 3, 2022 11:33 am wrote:
> > >
> > > What is the problem you are trying to solve? You want to shrink the average size
> > > of an instruction from 4 bytes to, I don't know, 3.2 bytes? Why is that worth doing?
> > > Or, if worth doing, not worth doing via something like CodePack instead?
> > >
> > > If you insist on stacks (even 8 stacks) you give up on register renaming and all that implies
> > > for extreme OoO (hundreds of instruction in size), extreme width, and extreme speculation.
> > > And if your answer is "well I'll implement the top few registers of the stacks as renamed
> > > registers", well, then, what exactly are we simplifying by not using registers directly?
> > >
> > > I understand the impulse; we're all irritated by the ISA bits that are burned by having
> > > to specify each of multiple registers. But ultimately, that's what computing IS -- specifying
> > > moving data. In a way, your viewpoint is like people still insisting that computing is
> > > about adds and multiplies, and everything else is "overhead". NO, it's NOT!
> > > There is some specialized computing that is about adds and multiplies, yes. But
> > > general computing is as much or more about moving data around (with everything
> > > that implies about specifying addresses) as it is about adds and multiplies.
> > >
> > > Instruction Fetch and Prefetch are SOLVED problems. We know the numbers.
> > > Even with a 32KB cache, essentially perfect Prefetch or equivalents (like a perfect L1I) are
> > > worth about 25% (details vary depending on exact details of core width, L2 parameters, etc).
> > > Existing practical prefetchers like RDIP get you about 20%.
> > > Experimental prefetchers like D-JOLT (easy to implement
> > > if, like Apple, you already have an RDIP-like infrastructure)
> > > get you a few percent closer to perfection. Even
> > > better is something like an Entangled I-prefetcher, but that requires building a new infrastructure.
> > > Point is we know how to avoid basically *all* the costs of I-prefetch
> > > - Decoupled Fetch (check for Apple, ARM Ltd, and I expect recent x86)
> > > - Decoupled Address generation (next step, no-one is doing
> > > this yet, not even Apple, but it's "not hard" (hah!)
> > > - something like D-JOLT to handle "long-distance" Prefetch.
> > >
> > > What remains a problem is not the presence (or not) of instructions in the L1I; it is
> > > the presence (or not) of control flow data relevant to those new instructions...
> > > That's why IBM z/ devotes utterly insane amounts of area to their L2 BTB. And it's why I think Decoupled
> > > Address generation is the obvious next step for Apple and then everyone else (Decoupled Address generation
> > > allows for the occasional access of an L2 BTB taking 2 or 3 cycles rather than single cycle).
> > > But making the ISA denser helps with none of this!
> >
> > I would make this proposal 8 address registers and 4 stacks, as that is what is going to happen.
> > The goal is to split the rename in half, addressing and data, and this also would help in read/write ports.
> >
> > The ghost of the 68000 may live again. ;)
> > Except not sucky and brain damaged like the 68000. ;)
>
> How the New68k architecture will first show up as an x86 extension. ;)
>
> X86 has a up to 5% disadvantage to ARM64 due to having only 16 integer registers.
> Now x86 could just add more integer registers but a better option is
> available, adding single integer instructions to the vector unit.
>
> The vector unit already has float and double support that do not cause the clock slowdown of using the full
> vector unit. And by adding single integer instructions to the vector unit you get another 6 wide of read/write
> ports to use, plus a separate rename unit. The only hard work is widening the commit stage to 10 or so.
>
> X86 will leap back into the front of benchmarks ahead of the Apple M2.
>
> Yes these will be 6 byte instructions, but we are talking about leaf functions
> with loops, and so you will do 8 wide or so issue from the decode cache.
>
> This started as a troll, but the idea has grown on me, prove me wrong. ;)
Crickets. ;)
I have been pushing a revival of split register 68k style addressing for two decades now.
Back in the three wide generation I was rightly mocked as the complexity was not needed.
In the five wide generation the 68k itself was mocked, which was not my point.
Now crickets. ;)
Engineers are so predictable, you can’t talk because now this arch is in the cards.
The problem is getting Microsoft on board, as AMD and Intel will not agree on encoding.
Of course the ARM groups read this, and now they know, and can do the same. ;)
There is enough employee churn between these companies that they have no secrets.
At this point I would bet Apple is the first to do single ints in the vector unit, back to 68k style split register files for the Mac. ;)
And 10 wide execute to crush the competition. ;)
> Brett (ggtgp.delete@this.yahoo.com) on August 3, 2022 4:31 pm wrote:
> > --- (---.delete@this.redheron.com) on August 3, 2022 2:55 pm wrote:
> > > Adrian (a.delete@this.acm.org) on August 3, 2022 11:33 am wrote:
> > >
> > > What is the problem you are trying to solve? You want to shrink the average size
> > > of an instruction from 4 bytes to, I don't know, 3.2 bytes? Why is that worth doing?
> > > Or, if worth doing, not worth doing via something like CodePack instead?
> > >
> > > If you insist on stacks (even 8 stacks) you give up on register renaming and all that implies
> > > for extreme OoO (hundreds of instruction in size), extreme width, and extreme speculation.
> > > And if your answer is "well I'll implement the top few registers of the stacks as renamed
> > > registers", well, then, what exactly are we simplifying by not using registers directly?
> > >
> > > I understand the impulse; we're all irritated by the ISA bits that are burned by having
> > > to specify each of multiple registers. But ultimately, that's what computing IS -- specifying
> > > moving data. In a way, your viewpoint is like people still insisting that computing is
> > > about adds and multiplies, and everything else is "overhead". NO, it's NOT!
> > > There is some specialized computing that is about adds and multiplies, yes. But
> > > general computing is as much or more about moving data around (with everything
> > > that implies about specifying addresses) as it is about adds and multiplies.
> > >
> > > Instruction Fetch and Prefetch are SOLVED problems. We know the numbers.
> > > Even with a 32KB cache, essentially perfect Prefetch or equivalents (like a perfect L1I) are
> > > worth about 25% (details vary depending on exact details of core width, L2 parameters, etc).
> > > Existing practical prefetchers like RDIP get you about 20%.
> > > Experimental prefetchers like D-JOLT (easy to implement
> > > if, like Apple, you already have an RDIP-like infrastructure)
> > > get you a few percent closer to perfection. Even
> > > better is something like an Entangled I-prefetcher, but that requires building a new infrastructure.
> > > Point is we know how to avoid basically *all* the costs of I-prefetch
> > > - Decoupled Fetch (check for Apple, ARM Ltd, and I expect recent x86)
> > > - Decoupled Address generation (next step, no-one is doing
> > > this yet, not even Apple, but it's "not hard" (hah!)
> > > - something like D-JOLT to handle "long-distance" Prefetch.
> > >
> > > What remains a problem is not the presence (or not) of instructions in the L1I; it is
> > > the presence (or not) of control flow data relevant to those new instructions...
> > > That's why IBM z/ devotes utterly insane amounts of area to their L2 BTB. And it's why I think Decoupled
> > > Address generation is the obvious next step for Apple and then everyone else (Decoupled Address generation
> > > allows for the occasional access of an L2 BTB taking 2 or 3 cycles rather than single cycle).
> > > But making the ISA denser helps with none of this!
> >
> > I would make this proposal 8 address registers and 4 stacks, as that is what is going to happen.
> > The goal is to split the rename in half, addressing and data, and this also would help in read/write ports.
> >
> > The ghost of the 68000 may live again. ;)
> > Except not sucky and brain damaged like the 68000. ;)
>
> How the New68k architecture will first show up as an x86 extension. ;)
>
> X86 has a up to 5% disadvantage to ARM64 due to having only 16 integer registers.
> Now x86 could just add more integer registers but a better option is
> available, adding single integer instructions to the vector unit.
>
> The vector unit already has float and double support that do not cause the clock slowdown of using the full
> vector unit. And by adding single integer instructions to the vector unit you get another 6 wide of read/write
> ports to use, plus a separate rename unit. The only hard work is widening the commit stage to 10 or so.
>
> X86 will leap back into the front of benchmarks ahead of the Apple M2.
>
> Yes these will be 6 byte instructions, but we are talking about leaf functions
> with loops, and so you will do 8 wide or so issue from the decode cache.
>
> This started as a troll, but the idea has grown on me, prove me wrong. ;)
Crickets. ;)
I have been pushing a revival of split register 68k style addressing for two decades now.
Back in the three wide generation I was rightly mocked as the complexity was not needed.
In the five wide generation the 68k itself was mocked, which was not my point.
Now crickets. ;)
Engineers are so predictable, you can’t talk because now this arch is in the cards.
The problem is getting Microsoft on board, as AMD and Intel will not agree on encoding.
Of course the ARM groups read this, and now they know, and can do the same. ;)
There is enough employee churn between these companies that they have no secrets.
At this point I would bet Apple is the first to do single ints in the vector unit, back to 68k style split register files for the Mac. ;)
And 10 wide execute to crush the competition. ;)