By: anon2 (anon.delete@this.anon.com), August 4, 2022 7:51 pm
Room: Moderated Discussions
--- (---.delete@this.redheron.com) on August 4, 2022 7:33 pm wrote:
> anon2 (anon.delete@this.anon.com) on August 3, 2022 5:07 pm wrote:
> > Doug S (foo.delete@this.bar.bar) on August 3, 2022 11:44 am wrote:
> > > anon2 (anon.delete@this.anon.com) on August 3, 2022 12:34 am wrote:
> > > > gallier2 (gallier2.delete@this.gmx.de) on August 2, 2022 11:34 pm wrote:
> > > > > anon2 (anon.delete@this.anon.com) on August 2, 2022 10:41 pm wrote:
> > > > > > Doug S (foo.delete@this.bar.bar) on August 2, 2022 9:09 pm wrote:
> > > > > > > > Yes that was my point - this is a much better approach than forcing big immediates in the instruction
> > > > > > > > stream. Given that most 64-bit immediates are "easy" (many bits are zeroes, ones or some other
> > > > > > > > repeating pattern), you can often create a 64-bit immediate in just 2 instructions. So that
> > > > > > > > makes any instruction that emits a full 64-bit immediate even less useful.
> > > > > > >
> > > > > > >
> > > > > > > And unless you REALLY care about code density, you should expect implementations should
> > > > > > > be able to combine the 2 or 4 instructions building up a 64 bit immediate into a single
> > > > > > > instruction in the pipeline.
> > > > > >
> > > > > > Debatable. Having an internal operation size that can accommodate an operation with a 64-bit
> > > > > > immediate could have some implementation cost. The infrequency of such immediates means
> > > > > > it may not be worthwhile. But you could say that if such a thing did prove to be useful,
> > > > > > then the RISC approach does not prevent it from being done with instruction fusion.
> > > > >
> > > > > Is it really that infrequent? I remember vividly my gcc SPARC compiler on our Solaris servers generating
> > > > > a lot of 64 bit constants (so called absolute addresses). The code was full of the 5 instructions
> > > > > sequence to generate a 64 bits constant (load immediate, shift, load immediate, or).
> > > > > Might be that the compiler we used was odd but absolute addresses
> > > > > are generally 64 bits wide and therefore not that rare.
> > > >
> > > > You're right for some ISAs and ABIs.
> > > >
> > > > Today things usually use relative addressing which tend to mostly use small constants
> > > > with PC+imm or PC+index+imm addressing modes, and/or a doubly indirect kind of offset
> > > > table addressing scheme (which also only needs small constants into the tables).
> > > >
> > > > The way things are today I doubt a new ISA or ABI would go back to requiring a lot of very large address
> > > > immediates. Having said that x86-64 regularly uses something like imm32+PC which is a largeish immediate,
> > > > and 2GB is really not a lot of addressibility these days, but fortunately the way most things are built,
> > > > most memory accesses occur either via pointers passed around interfaces, or within smaller units so even
> > > > 32-bit is probably much larger than needed for the vast majority of memory accesses by dynamic count.
> > >
> > >
> > > But going back to internal addressing formats, relative addresses must be converted (the immed and
> > > index added to the base/PC) into absolute addresses before the load/store/jmp/etc. So presumably the
> > > internal format has the ability to carry at least as many addressing bits as the ISA supports - and
> > > that number is fairly large for a general purpose architecture in 2022 and increasing over time.
> > >
> >
> > Not presumably. Source data is not carried with the instruction/uop in many high performance designs. Even
> > in cases where it is, a uop format does not need to remain fixed through all stages of the pipeline, you do
> > not have to have latches and busses for all source data in the front end, only from issue to execute.
>
> This is an interesting idea and certainly feasible
Not really interesting I think, it just is.
> (for example you could imagine that
> branches allocate a register from a hidden pool of "branch" registers,
Registers when you talk about hardware are particular type of storage that is not really relevant to distinuish IMO. If it's not an ISA register (i.e., it is "hidden" and just part of the implementation pipeline design) then it's just some storage with particular access properties. Many such data related to instructions are allocated as it goes through the pipeline.
> and this register
> holds the PC+offset that will be required by the branch at some later point).
"uops" as you think of them contain contain connections to many pieces of data in various structures, during many stages of pipeline. a uop probably does not carry its instruction address (PC) along with it (except in very early fetch stages if any). But it almost certainly does contain the necessary information to be able to find that PC. So that information exists. Whether it is duplicated and staged nearby the scheduler or branch unit when it comes to issue or execute is a matter about implementation detail of what the control and data paths of various structures look like, what is most power efficient.
> anon2 (anon.delete@this.anon.com) on August 3, 2022 5:07 pm wrote:
> > Doug S (foo.delete@this.bar.bar) on August 3, 2022 11:44 am wrote:
> > > anon2 (anon.delete@this.anon.com) on August 3, 2022 12:34 am wrote:
> > > > gallier2 (gallier2.delete@this.gmx.de) on August 2, 2022 11:34 pm wrote:
> > > > > anon2 (anon.delete@this.anon.com) on August 2, 2022 10:41 pm wrote:
> > > > > > Doug S (foo.delete@this.bar.bar) on August 2, 2022 9:09 pm wrote:
> > > > > > > > Yes that was my point - this is a much better approach than forcing big immediates in the instruction
> > > > > > > > stream. Given that most 64-bit immediates are "easy" (many bits are zeroes, ones or some other
> > > > > > > > repeating pattern), you can often create a 64-bit immediate in just 2 instructions. So that
> > > > > > > > makes any instruction that emits a full 64-bit immediate even less useful.
> > > > > > >
> > > > > > >
> > > > > > > And unless you REALLY care about code density, you should expect implementations should
> > > > > > > be able to combine the 2 or 4 instructions building up a 64 bit immediate into a single
> > > > > > > instruction in the pipeline.
> > > > > >
> > > > > > Debatable. Having an internal operation size that can accommodate an operation with a 64-bit
> > > > > > immediate could have some implementation cost. The infrequency of such immediates means
> > > > > > it may not be worthwhile. But you could say that if such a thing did prove to be useful,
> > > > > > then the RISC approach does not prevent it from being done with instruction fusion.
> > > > >
> > > > > Is it really that infrequent? I remember vividly my gcc SPARC compiler on our Solaris servers generating
> > > > > a lot of 64 bit constants (so called absolute addresses). The code was full of the 5 instructions
> > > > > sequence to generate a 64 bits constant (load immediate, shift, load immediate, or).
> > > > > Might be that the compiler we used was odd but absolute addresses
> > > > > are generally 64 bits wide and therefore not that rare.
> > > >
> > > > You're right for some ISAs and ABIs.
> > > >
> > > > Today things usually use relative addressing which tend to mostly use small constants
> > > > with PC+imm or PC+index+imm addressing modes, and/or a doubly indirect kind of offset
> > > > table addressing scheme (which also only needs small constants into the tables).
> > > >
> > > > The way things are today I doubt a new ISA or ABI would go back to requiring a lot of very large address
> > > > immediates. Having said that x86-64 regularly uses something like imm32+PC which is a largeish immediate,
> > > > and 2GB is really not a lot of addressibility these days, but fortunately the way most things are built,
> > > > most memory accesses occur either via pointers passed around interfaces, or within smaller units so even
> > > > 32-bit is probably much larger than needed for the vast majority of memory accesses by dynamic count.
> > >
> > >
> > > But going back to internal addressing formats, relative addresses must be converted (the immed and
> > > index added to the base/PC) into absolute addresses before the load/store/jmp/etc. So presumably the
> > > internal format has the ability to carry at least as many addressing bits as the ISA supports - and
> > > that number is fairly large for a general purpose architecture in 2022 and increasing over time.
> > >
> >
> > Not presumably. Source data is not carried with the instruction/uop in many high performance designs. Even
> > in cases where it is, a uop format does not need to remain fixed through all stages of the pipeline, you do
> > not have to have latches and busses for all source data in the front end, only from issue to execute.
>
> This is an interesting idea and certainly feasible
Not really interesting I think, it just is.
> (for example you could imagine that
> branches allocate a register from a hidden pool of "branch" registers,
Registers when you talk about hardware are particular type of storage that is not really relevant to distinuish IMO. If it's not an ISA register (i.e., it is "hidden" and just part of the implementation pipeline design) then it's just some storage with particular access properties. Many such data related to instructions are allocated as it goes through the pipeline.
> and this register
> holds the PC+offset that will be required by the branch at some later point).
"uops" as you think of them contain contain connections to many pieces of data in various structures, during many stages of pipeline. a uop probably does not carry its instruction address (PC) along with it (except in very early fetch stages if any). But it almost certainly does contain the necessary information to be able to find that PC. So that information exists. Whether it is duplicated and staged nearby the scheduler or branch unit when it comes to issue or execute is a matter about implementation detail of what the control and data paths of various structures look like, what is most power efficient.