By: Brett (ggtgp.delete@this.yahoo.com), August 3, 2022 4:31 pm
Room: Moderated Discussions
--- (---.delete@this.redheron.com) on August 3, 2022 2:55 pm wrote:
> Adrian (a.delete@this.acm.org) on August 3, 2022 11:33 am wrote:
> > NoSpammer (no.delete@this.spam.com) on August 3, 2022 7:55 am wrote:
> > > Adrian (a.delete@this.acm.org) on August 3, 2022 2:19 am wrote:
> > > > Mark Roulo (nothanks.delete@this.xxx.com) on August 2, 2022 6:48 pm wrote:
> > > > >Stack or memory based architecture (you really want registers)
> > > >
> > > >
> > > > Nobody has proven yet whether encoding the operands implicitly with operations defined
> > > > on a stack is better or worse than encoding the operands using register numbers.
> > >
> > > So mainstream architectures that used to use stack-like structures like intel x87
> > > and archs using register windows being obsoleted are not a practical proof?
> >
> >
> > I have already said that the problem of 8087 and of all other now obsolete
> > ISAs is using only 1 operand stack, not unsing operand stacks.
> >
> > So what you claim to be a practical proof is certainly no proof at all.
> >
> >
> > >
> > > > The problem with the early stack-based instruction encodings, like Intel
> > > > 8087, was that they used ONLY ONE STACK, not that they were stack based.
> > >
> > > How would N+1 stacks improve performance for general purpose
> > > code and compilers? How about spillovers and function calls?
> >
> > The architectures with only 1 stack have been made obsolete
> > by the CPUs with multiple pipelined execution units.
> >
> > Any such CPU needs the interleaving of multiple chains of dependent instructions to keep
> > busy the execution units. An instruction stream using operand stacks cannot encode more
> > separate chains of dependent instructions than the number of operand stacks.
> >
> > That is why any revival of a stack-based ISA would need to use multiple operand
> > stacks, as many as the number of distinct chains of dependent instructions
> > that are needed to keep all the pipelined execution units filled.
> >
> >
> >
> >
> > >
> > > > In modern OoO CPUs, the register numbers used in the instruction encodings are no longer
> > > > addresses, but only means to specify the data dependencies between instructions. The same
> > > > data dependencies can be specified implicitly when the operand belong to a stack.
> > >
> > > I would correct you that the register numbers are used to
> > > specify EXPLICIT dependencies between instructions,
> > > which is very important if you want to parse dependencies quickly without additional indirection.
> >
> > Here you might touch a real problem of a stack-based ISA. I have not tried to implement
> > such a CPU, so I do not know how much complexity is added by the need to generate
> > the equivalent register numbers by keeping track of a stack position.
> >
> > It is possible that a stack-based ISA might need too much extra
> > area and/or it might have a lower clock frequency limit.
> >
> > However, I have already said this in my previous post, maybe a stack-based
> > ISA is not worthwhile, but in any case nobody has proven this yet.
> >
> > The ISAs with only 1 stack have been rightly abandoned, but the reason why that
> > has been done is not applicable for an ISA with multiple operand stacks.
> >
> >
> >
> >
> > >
> > > > Allowing each instruction to specify 1 of 4 operand stacks might be enough
> > > > to remove most false dependencies, while specifying 1 of 8 operand stacks
> > > > is pretty much certain to prevent the apparition of false dependencies.
> > >
> > > Sure if 8 registers only works kinda ok for most code, 8 stacks will do. Maybe you actually
> > > keep the top of the stack in the register and spill over the stack, oh look, x86.
> >
> > No, the number of operand stacks which are needed for a given throughput is much less
> > than the number of addressable registers that are needed for the same throughput.
> >
> > When compared with an ISA with 3 register addresses per instruction, an ISA with 8 operand stacks
> > is more like having at least 24 addressable registers, as they can have similar limits for the
> > number of distinct chains of dependent instructions that can be encoded. The exact equivalence
> > would vary from program to program. It is likely that are many cases when encoding 8 distinct
> > chains of dependent instructions would require much more than 32 addressable registers.
> >
> > The register renaming could be implemented in such a way so that only the sum
> > of the stack lengths of all stacks is limited by the number of registers in the
> > register file, without lower limits for the length of any individual stack.
> >
>
> What is the problem you are trying to solve? You want to shrink the average size
> of an instruction from 4 bytes to, I don't know, 3.2 bytes? Why is that worth doing?
> Or, if worth doing, not worth doing via something like CodePack instead?
>
> If you insist on stacks (even 8 stacks) you give up on register renaming and all that implies
> for extreme OoO (hundreds of instruction in size), extreme width, and extreme speculation.
> And if your answer is "well I'll implement the top few registers of the stacks as renamed
> registers", well, then, what exactly are we simplifying by not using registers directly?
>
> I understand the impulse; we're all irritated by the ISA bits that are burned by having
> to specify each of multiple registers. But ultimately, that's what computing IS -- specifying
> moving data. In a way, your viewpoint is like people still insisting that computing is
> about adds and multiplies, and everything else is "overhead". NO, it's NOT!
> There is some specialized computing that is about adds and multiplies, yes. But
> general computing is as much or more about moving data around (with everything
> that implies about specifying addresses) as it is about adds and multiplies.
>
> Instruction Fetch and Prefetch are SOLVED problems. We know the numbers.
> Even with a 32KB cache, essentially perfect Prefetch or equivalents (like a perfect L1I) are
> worth about 25% (details vary depending on exact details of core width, L2 parameters, etc).
> Existing practical prefetchers like RDIP get you about 20%. Experimental prefetchers like D-JOLT (easy to implement
> if, like Apple, you already have an RDIP-like infrastructure) get you a few percent closer to perfection. Even
> better is something like an Entangled I-prefetcher, but that requires building a new infrastructure.
> Point is we know how to avoid basically *all* the costs of I-prefetch
> - Decoupled Fetch (check for Apple, ARM Ltd, and I expect recent x86)
> - Decoupled Address generation (next step, no-one is doing
> this yet, not even Apple, but it's "not hard" (hah!)
> - something like D-JOLT to handle "long-distance" Prefetch.
>
> What remains a problem is not the presence (or not) of instructions in the L1I; it is
> the presence (or not) of control flow data relevant to those new instructions...
> That's why IBM z/ devotes utterly insane amounts of area to their L2 BTB. And it's why I think Decoupled
> Address generation is the obvious next step for Apple and then everyone else (Decoupled Address generation
> allows for the occasional access of an L2 BTB taking 2 or 3 cycles rather than single cycle).
> But making the ISA denser helps with none of this!
I would make this proposal 8 address registers and 4 stacks, as that is what is going to happen.
The goal is to split the rename in half, addressing and data, and this also would help in read/write ports.
The ghost of the 68000 may live again. ;)
Except not sucky and brain damaged like the 68000. ;)
> Adrian (a.delete@this.acm.org) on August 3, 2022 11:33 am wrote:
> > NoSpammer (no.delete@this.spam.com) on August 3, 2022 7:55 am wrote:
> > > Adrian (a.delete@this.acm.org) on August 3, 2022 2:19 am wrote:
> > > > Mark Roulo (nothanks.delete@this.xxx.com) on August 2, 2022 6:48 pm wrote:
> > > > >
> > > >
> > > >
> > > > Nobody has proven yet whether encoding the operands implicitly with operations defined
> > > > on a stack is better or worse than encoding the operands using register numbers.
> > >
> > > So mainstream architectures that used to use stack-like structures like intel x87
> > > and archs using register windows being obsoleted are not a practical proof?
> >
> >
> > I have already said that the problem of 8087 and of all other now obsolete
> > ISAs is using only 1 operand stack, not unsing operand stacks.
> >
> > So what you claim to be a practical proof is certainly no proof at all.
> >
> >
> > >
> > > > The problem with the early stack-based instruction encodings, like Intel
> > > > 8087, was that they used ONLY ONE STACK, not that they were stack based.
> > >
> > > How would N+1 stacks improve performance for general purpose
> > > code and compilers? How about spillovers and function calls?
> >
> > The architectures with only 1 stack have been made obsolete
> > by the CPUs with multiple pipelined execution units.
> >
> > Any such CPU needs the interleaving of multiple chains of dependent instructions to keep
> > busy the execution units. An instruction stream using operand stacks cannot encode more
> > separate chains of dependent instructions than the number of operand stacks.
> >
> > That is why any revival of a stack-based ISA would need to use multiple operand
> > stacks, as many as the number of distinct chains of dependent instructions
> > that are needed to keep all the pipelined execution units filled.
> >
> >
> >
> >
> > >
> > > > In modern OoO CPUs, the register numbers used in the instruction encodings are no longer
> > > > addresses, but only means to specify the data dependencies between instructions. The same
> > > > data dependencies can be specified implicitly when the operand belong to a stack.
> > >
> > > I would correct you that the register numbers are used to
> > > specify EXPLICIT dependencies between instructions,
> > > which is very important if you want to parse dependencies quickly without additional indirection.
> >
> > Here you might touch a real problem of a stack-based ISA. I have not tried to implement
> > such a CPU, so I do not know how much complexity is added by the need to generate
> > the equivalent register numbers by keeping track of a stack position.
> >
> > It is possible that a stack-based ISA might need too much extra
> > area and/or it might have a lower clock frequency limit.
> >
> > However, I have already said this in my previous post, maybe a stack-based
> > ISA is not worthwhile, but in any case nobody has proven this yet.
> >
> > The ISAs with only 1 stack have been rightly abandoned, but the reason why that
> > has been done is not applicable for an ISA with multiple operand stacks.
> >
> >
> >
> >
> > >
> > > > Allowing each instruction to specify 1 of 4 operand stacks might be enough
> > > > to remove most false dependencies, while specifying 1 of 8 operand stacks
> > > > is pretty much certain to prevent the apparition of false dependencies.
> > >
> > > Sure if 8 registers only works kinda ok for most code, 8 stacks will do. Maybe you actually
> > > keep the top of the stack in the register and spill over the stack, oh look, x86.
> >
> > No, the number of operand stacks which are needed for a given throughput is much less
> > than the number of addressable registers that are needed for the same throughput.
> >
> > When compared with an ISA with 3 register addresses per instruction, an ISA with 8 operand stacks
> > is more like having at least 24 addressable registers, as they can have similar limits for the
> > number of distinct chains of dependent instructions that can be encoded. The exact equivalence
> > would vary from program to program. It is likely that are many cases when encoding 8 distinct
> > chains of dependent instructions would require much more than 32 addressable registers.
> >
> > The register renaming could be implemented in such a way so that only the sum
> > of the stack lengths of all stacks is limited by the number of registers in the
> > register file, without lower limits for the length of any individual stack.
> >
>
> What is the problem you are trying to solve? You want to shrink the average size
> of an instruction from 4 bytes to, I don't know, 3.2 bytes? Why is that worth doing?
> Or, if worth doing, not worth doing via something like CodePack instead?
>
> If you insist on stacks (even 8 stacks) you give up on register renaming and all that implies
> for extreme OoO (hundreds of instruction in size), extreme width, and extreme speculation.
> And if your answer is "well I'll implement the top few registers of the stacks as renamed
> registers", well, then, what exactly are we simplifying by not using registers directly?
>
> I understand the impulse; we're all irritated by the ISA bits that are burned by having
> to specify each of multiple registers. But ultimately, that's what computing IS -- specifying
> moving data. In a way, your viewpoint is like people still insisting that computing is
> about adds and multiplies, and everything else is "overhead". NO, it's NOT!
> There is some specialized computing that is about adds and multiplies, yes. But
> general computing is as much or more about moving data around (with everything
> that implies about specifying addresses) as it is about adds and multiplies.
>
> Instruction Fetch and Prefetch are SOLVED problems. We know the numbers.
> Even with a 32KB cache, essentially perfect Prefetch or equivalents (like a perfect L1I) are
> worth about 25% (details vary depending on exact details of core width, L2 parameters, etc).
> Existing practical prefetchers like RDIP get you about 20%. Experimental prefetchers like D-JOLT (easy to implement
> if, like Apple, you already have an RDIP-like infrastructure) get you a few percent closer to perfection. Even
> better is something like an Entangled I-prefetcher, but that requires building a new infrastructure.
> Point is we know how to avoid basically *all* the costs of I-prefetch
> - Decoupled Fetch (check for Apple, ARM Ltd, and I expect recent x86)
> - Decoupled Address generation (next step, no-one is doing
> this yet, not even Apple, but it's "not hard" (hah!)
> - something like D-JOLT to handle "long-distance" Prefetch.
>
> What remains a problem is not the presence (or not) of instructions in the L1I; it is
> the presence (or not) of control flow data relevant to those new instructions...
> That's why IBM z/ devotes utterly insane amounts of area to their L2 BTB. And it's why I think Decoupled
> Address generation is the obvious next step for Apple and then everyone else (Decoupled Address generation
> allows for the occasional access of an L2 BTB taking 2 or 3 cycles rather than single cycle).
> But making the ISA denser helps with none of this!
I would make this proposal 8 address registers and 4 stacks, as that is what is going to happen.
The goal is to split the rename in half, addressing and data, and this also would help in read/write ports.
The ghost of the 68000 may live again. ;)
Except not sucky and brain damaged like the 68000. ;)