By: --- (---.delete@this.redheron.com), August 3, 2022 2:55 pm
Room: Moderated Discussions
Adrian (a.delete@this.acm.org) on August 3, 2022 11:33 am wrote:
> NoSpammer (no.delete@this.spam.com) on August 3, 2022 7:55 am wrote:
> > Adrian (a.delete@this.acm.org) on August 3, 2022 2:19 am wrote:
> > > Mark Roulo (nothanks.delete@this.xxx.com) on August 2, 2022 6:48 pm wrote:
> > > >Stack or memory based architecture (you really want registers)
> > >
> > >
> > > Nobody has proven yet whether encoding the operands implicitly with operations defined
> > > on a stack is better or worse than encoding the operands using register numbers.
> >
> > So mainstream architectures that used to use stack-like structures like intel x87
> > and archs using register windows being obsoleted are not a practical proof?
>
>
> I have already said that the problem of 8087 and of all other now obsolete
> ISAs is using only 1 operand stack, not unsing operand stacks.
>
> So what you claim to be a practical proof is certainly no proof at all.
>
>
> >
> > > The problem with the early stack-based instruction encodings, like Intel
> > > 8087, was that they used ONLY ONE STACK, not that they were stack based.
> >
> > How would N+1 stacks improve performance for general purpose
> > code and compilers? How about spillovers and function calls?
>
> The architectures with only 1 stack have been made obsolete
> by the CPUs with multiple pipelined execution units.
>
> Any such CPU needs the interleaving of multiple chains of dependent instructions to keep
> busy the execution units. An instruction stream using operand stacks cannot encode more
> separate chains of dependent instructions than the number of operand stacks.
>
> That is why any revival of a stack-based ISA would need to use multiple operand
> stacks, as many as the number of distinct chains of dependent instructions
> that are needed to keep all the pipelined execution units filled.
>
>
>
>
> >
> > > In modern OoO CPUs, the register numbers used in the instruction encodings are no longer
> > > addresses, but only means to specify the data dependencies between instructions. The same
> > > data dependencies can be specified implicitly when the operand belong to a stack.
> >
> > I would correct you that the register numbers are used to
> > specify EXPLICIT dependencies between instructions,
> > which is very important if you want to parse dependencies quickly without additional indirection.
>
> Here you might touch a real problem of a stack-based ISA. I have not tried to implement
> such a CPU, so I do not know how much complexity is added by the need to generate
> the equivalent register numbers by keeping track of a stack position.
>
> It is possible that a stack-based ISA might need too much extra
> area and/or it might have a lower clock frequency limit.
>
> However, I have already said this in my previous post, maybe a stack-based
> ISA is not worthwhile, but in any case nobody has proven this yet.
>
> The ISAs with only 1 stack have been rightly abandoned, but the reason why that
> has been done is not applicable for an ISA with multiple operand stacks.
>
>
>
>
> >
> > > Allowing each instruction to specify 1 of 4 operand stacks might be enough
> > > to remove most false dependencies, while specifying 1 of 8 operand stacks
> > > is pretty much certain to prevent the apparition of false dependencies.
> >
> > Sure if 8 registers only works kinda ok for most code, 8 stacks will do. Maybe you actually
> > keep the top of the stack in the register and spill over the stack, oh look, x86.
>
> No, the number of operand stacks which are needed for a given throughput is much less
> than the number of addressable registers that are needed for the same throughput.
>
> When compared with an ISA with 3 register addresses per instruction, an ISA with 8 operand stacks
> is more like having at least 24 addressable registers, as they can have similar limits for the
> number of distinct chains of dependent instructions that can be encoded. The exact equivalence
> would vary from program to program. It is likely that are many cases when encoding 8 distinct
> chains of dependent instructions would require much more than 32 addressable registers.
>
> The register renaming could be implemented in such a way so that only the sum
> of the stack lengths of all stacks is limited by the number of registers in the
> register file, without lower limits for the length of any individual stack.
>
What is the problem you are trying to solve? You want to shrink the average size of an instruction from 4 bytes to, I don't know, 3.2 bytes? Why is that worth doing? Or, if worth doing, not worth doing via something like CodePack instead?
If you insist on stacks (even 8 stacks) you give up on register renaming and all that implies for extreme OoO (hundreds of instruction in size), extreme width, and extreme speculation.
And if your answer is "well I'll implement the top few registers of the stacks as renamed registers", well, then, what exactly are we simplifying by not using registers directly?
I understand the impulse; we're all irritated by the ISA bits that are burned by having to specify each of multiple registers. But ultimately, that's what computing IS -- specifying moving data. In a way, your viewpoint is like people still insisting that computing is about adds and multiplies, and everything else is "overhead". NO, it's NOT!
There is some specialized computing that is about adds and multiplies, yes. But general computing is as much or more about moving data around (with everything that implies about specifying addresses) as it is about adds and multiplies.
Instruction Fetch and Prefetch are SOLVED problems. We know the numbers.
Even with a 32KB cache, essentially perfect Prefetch or equivalents (like a perfect L1I) are worth about 25% (details vary depending on exact details of core width, L2 parameters, etc).
Existing practical prefetchers like RDIP get you about 20%. Experimental prefetchers like D-JOLT (easy to implement if, like Apple, you already have an RDIP-like infrastructure) get you a few percent closer to perfection. Even better is something like an Entangled I-prefetcher, but that requires building a new infrastructure.
Point is we know how to avoid basically *all* the costs of I-prefetch
- Decoupled Fetch (check for Apple, ARM Ltd, and I expect recent x86)
- Decoupled Address generation (next step, no-one is doing this yet, not even Apple, but it's "not hard" (hah!)
- something like D-JOLT to handle "long-distance" Prefetch.
What remains a problem is not the presence (or not) of instructions in the L1I; it is the presence (or not) of control flow data relevant to those new instructions...
That's why IBM z/ devotes utterly insane amounts of area to their L2 BTB. And it's why I think Decoupled Address generation is the obvious next step for Apple and then everyone else (Decoupled Address generation allows for the occasional access of an L2 BTB taking 2 or 3 cycles rather than single cycle).
But making the ISA denser helps with none of this!
> NoSpammer (no.delete@this.spam.com) on August 3, 2022 7:55 am wrote:
> > Adrian (a.delete@this.acm.org) on August 3, 2022 2:19 am wrote:
> > > Mark Roulo (nothanks.delete@this.xxx.com) on August 2, 2022 6:48 pm wrote:
> > > >
> > >
> > >
> > > Nobody has proven yet whether encoding the operands implicitly with operations defined
> > > on a stack is better or worse than encoding the operands using register numbers.
> >
> > So mainstream architectures that used to use stack-like structures like intel x87
> > and archs using register windows being obsoleted are not a practical proof?
>
>
> I have already said that the problem of 8087 and of all other now obsolete
> ISAs is using only 1 operand stack, not unsing operand stacks.
>
> So what you claim to be a practical proof is certainly no proof at all.
>
>
> >
> > > The problem with the early stack-based instruction encodings, like Intel
> > > 8087, was that they used ONLY ONE STACK, not that they were stack based.
> >
> > How would N+1 stacks improve performance for general purpose
> > code and compilers? How about spillovers and function calls?
>
> The architectures with only 1 stack have been made obsolete
> by the CPUs with multiple pipelined execution units.
>
> Any such CPU needs the interleaving of multiple chains of dependent instructions to keep
> busy the execution units. An instruction stream using operand stacks cannot encode more
> separate chains of dependent instructions than the number of operand stacks.
>
> That is why any revival of a stack-based ISA would need to use multiple operand
> stacks, as many as the number of distinct chains of dependent instructions
> that are needed to keep all the pipelined execution units filled.
>
>
>
>
> >
> > > In modern OoO CPUs, the register numbers used in the instruction encodings are no longer
> > > addresses, but only means to specify the data dependencies between instructions. The same
> > > data dependencies can be specified implicitly when the operand belong to a stack.
> >
> > I would correct you that the register numbers are used to
> > specify EXPLICIT dependencies between instructions,
> > which is very important if you want to parse dependencies quickly without additional indirection.
>
> Here you might touch a real problem of a stack-based ISA. I have not tried to implement
> such a CPU, so I do not know how much complexity is added by the need to generate
> the equivalent register numbers by keeping track of a stack position.
>
> It is possible that a stack-based ISA might need too much extra
> area and/or it might have a lower clock frequency limit.
>
> However, I have already said this in my previous post, maybe a stack-based
> ISA is not worthwhile, but in any case nobody has proven this yet.
>
> The ISAs with only 1 stack have been rightly abandoned, but the reason why that
> has been done is not applicable for an ISA with multiple operand stacks.
>
>
>
>
> >
> > > Allowing each instruction to specify 1 of 4 operand stacks might be enough
> > > to remove most false dependencies, while specifying 1 of 8 operand stacks
> > > is pretty much certain to prevent the apparition of false dependencies.
> >
> > Sure if 8 registers only works kinda ok for most code, 8 stacks will do. Maybe you actually
> > keep the top of the stack in the register and spill over the stack, oh look, x86.
>
> No, the number of operand stacks which are needed for a given throughput is much less
> than the number of addressable registers that are needed for the same throughput.
>
> When compared with an ISA with 3 register addresses per instruction, an ISA with 8 operand stacks
> is more like having at least 24 addressable registers, as they can have similar limits for the
> number of distinct chains of dependent instructions that can be encoded. The exact equivalence
> would vary from program to program. It is likely that are many cases when encoding 8 distinct
> chains of dependent instructions would require much more than 32 addressable registers.
>
> The register renaming could be implemented in such a way so that only the sum
> of the stack lengths of all stacks is limited by the number of registers in the
> register file, without lower limits for the length of any individual stack.
>
What is the problem you are trying to solve? You want to shrink the average size of an instruction from 4 bytes to, I don't know, 3.2 bytes? Why is that worth doing? Or, if worth doing, not worth doing via something like CodePack instead?
If you insist on stacks (even 8 stacks) you give up on register renaming and all that implies for extreme OoO (hundreds of instruction in size), extreme width, and extreme speculation.
And if your answer is "well I'll implement the top few registers of the stacks as renamed registers", well, then, what exactly are we simplifying by not using registers directly?
I understand the impulse; we're all irritated by the ISA bits that are burned by having to specify each of multiple registers. But ultimately, that's what computing IS -- specifying moving data. In a way, your viewpoint is like people still insisting that computing is about adds and multiplies, and everything else is "overhead". NO, it's NOT!
There is some specialized computing that is about adds and multiplies, yes. But general computing is as much or more about moving data around (with everything that implies about specifying addresses) as it is about adds and multiplies.
Instruction Fetch and Prefetch are SOLVED problems. We know the numbers.
Even with a 32KB cache, essentially perfect Prefetch or equivalents (like a perfect L1I) are worth about 25% (details vary depending on exact details of core width, L2 parameters, etc).
Existing practical prefetchers like RDIP get you about 20%. Experimental prefetchers like D-JOLT (easy to implement if, like Apple, you already have an RDIP-like infrastructure) get you a few percent closer to perfection. Even better is something like an Entangled I-prefetcher, but that requires building a new infrastructure.
Point is we know how to avoid basically *all* the costs of I-prefetch
- Decoupled Fetch (check for Apple, ARM Ltd, and I expect recent x86)
- Decoupled Address generation (next step, no-one is doing this yet, not even Apple, but it's "not hard" (hah!)
- something like D-JOLT to handle "long-distance" Prefetch.
What remains a problem is not the presence (or not) of instructions in the L1I; it is the presence (or not) of control flow data relevant to those new instructions...
That's why IBM z/ devotes utterly insane amounts of area to their L2 BTB. And it's why I think Decoupled Address generation is the obvious next step for Apple and then everyone else (Decoupled Address generation allows for the occasional access of an L2 BTB taking 2 or 3 cycles rather than single cycle).
But making the ISA denser helps with none of this!