By: Marcus (m.delete@this.bitsnbites.eu), August 16, 2022 6:12 am
Room: Moderated Discussions
Wilco (wilco.dijkstra.delete@this.ntlworld.com) on August 16, 2022 5:44 am wrote:
> Marcus (m.delete@this.bitsnbites.eu) on August 16, 2022 4:53 am wrote:
> > Doug S (foo.delete@this.bar.bar) on August 15, 2022 9:52 pm wrote:
> > > Wilco (wilco.dijkstra.delete@this.ntlworld.com) on August 15, 2022 2:53 pm wrote:
> > > > Fusion is typically used on 2 instructions at a time, so fusing 4-5 instructions would be significantly
> > > > more complex. You could do wider fetch, but there is no chance you would be able to beat the throughput
> > > > of the load/store unit without spending a ridiculous amount of hardware on fetching and fusion,
> > > > adding extra cycles for fetch/decode, not to mention significantly more power.
> > >
> > >
> > > Sure, but that's because idioms that lend themselves to fusion (i.e. to what would usually be one CISCy
> > > instruction) typically are paired instructions so that's where you get the most bang for your buck.
> > >
> > > If an implementation can fuse two load immediate instructions for a 32 bit quantity it doesn't
> > > seem that hard for it to check if there are two more load immediate instructions following
> > > it to create a 64 bit quantity. If that check can't be done in a particular implementation
> > > due to complexity/power/whatever you're only one cycle slower than you would be otherwise
> > > - still matching the latency of the best case constant table load from L1.
> > >
> > > Perhaps if implementations have difficulty fusing four load immediate instructions a little help could be
> > > given in the ISA. Create a new load immediate opcode that provides the low 16 bits of a 64 bit value and is
> > > defined to ALWAYS be followed by three more load immediate instructions providing the remaining 48 bits.
> > >
> >
> > That sounds dangerous - it could lead to different behavior on different
> > implementations, if the guarantee does not hold true (bad code).
> >
> > > Upon seeing the first instruction the decoder would know it can fuse that with the next three
> > > and simply pull the required 16 bits out of each without fully decoding them. An implementation
> > > would not be required to fuse them, of course, the first instruction is just a 'hint' providing
> > > information to allow it, but not obligate it, to fuse all four. Indeed, it is quite possible
> > > a "big core" might fuse all four while a "little core" fuses only pairs or not at all.
> > >
> > > Whether the cost of an additional opcode is worth it depends on how often you
> > > think you'll have 64 vs 32 bit immediates and how difficult it would be for the
> > > decoder to fuse four instruction without getting this new "hint" instruction.
> >
> > I have not made any fusion hardware before, so I don't know what goes as simple vs complex. However, my
> > current plan would be to construct a 64-bit constant using
> > three consecutive 32-bit instructions: An initial
> > instruction that is a pure load 20-bit immediate (sign extend),
> > and two following shift-left-22-and-insert-22-bit-immediate
> > instructions (or something along those lines). The fusion logic would need to match 6+5+5=16 bits against
> > a known pattern (i.e. the opcode fields) and ensure that three 5-bit fields are equal (i.e. the destination
> > register specifiers). It does not sound overly-complex, but I could be wrong.
>
> The complexity comes from having to align the instructions somehow - you'll need to recognize this pattern
> at different offsets in each fetch and realign instructions to decoders based on it. You'll need to correctly
> handle multiple overlapping matches and partial matches at fetch boundaries. Then each decoder will need
> wiring for 3 instructions rather than 1 and emit a wider micro-op with the extra immediate.
>
> My question is whether you have any actual data that suggests burning 27 bits of your encoding
> space on it and adding fusion is really worth it? Good ISA design is all about measuring frequencies
> of particular idioms and only adding instructions (or uarch optimizations like fusion) where
> the improved performance is worth the cost. Modern cores mostly fuse compare+branch since those
> instructions are extremely common. 64-bit integer immediates simply aren't common.
>
> Wilco
True, and I don't have a good answer yet. Since my work is just a hobby project I don't have the resources, funding or the time to do a complete analysis of every design decision. My method is more based on trial-and-error, looking at code generated by the compiler for different archs and comparing that to what I get for my arch, and doing some fairly basic frequency measurements (compiling a bunch of programs and counting instructions). As my gcc back-end isn't perfectly tuned, these measurements are not 100% accurate, either. And so on.
The encoding of immediate values (immediate loads, arithmetic ops, bitwise ops, branch offsets, address offsets, floating-point & integer, etc) is certainly one of the trickier parts, and it's a moving target (if you tweak one instruction/encoding other related instructions are affected and/or may become largely redundant, etc), but it's an interesting challenge.
> Marcus (m.delete@this.bitsnbites.eu) on August 16, 2022 4:53 am wrote:
> > Doug S (foo.delete@this.bar.bar) on August 15, 2022 9:52 pm wrote:
> > > Wilco (wilco.dijkstra.delete@this.ntlworld.com) on August 15, 2022 2:53 pm wrote:
> > > > Fusion is typically used on 2 instructions at a time, so fusing 4-5 instructions would be significantly
> > > > more complex. You could do wider fetch, but there is no chance you would be able to beat the throughput
> > > > of the load/store unit without spending a ridiculous amount of hardware on fetching and fusion,
> > > > adding extra cycles for fetch/decode, not to mention significantly more power.
> > >
> > >
> > > Sure, but that's because idioms that lend themselves to fusion (i.e. to what would usually be one CISCy
> > > instruction) typically are paired instructions so that's where you get the most bang for your buck.
> > >
> > > If an implementation can fuse two load immediate instructions for a 32 bit quantity it doesn't
> > > seem that hard for it to check if there are two more load immediate instructions following
> > > it to create a 64 bit quantity. If that check can't be done in a particular implementation
> > > due to complexity/power/whatever you're only one cycle slower than you would be otherwise
> > > - still matching the latency of the best case constant table load from L1.
> > >
> > > Perhaps if implementations have difficulty fusing four load immediate instructions a little help could be
> > > given in the ISA. Create a new load immediate opcode that provides the low 16 bits of a 64 bit value and is
> > > defined to ALWAYS be followed by three more load immediate instructions providing the remaining 48 bits.
> > >
> >
> > That sounds dangerous - it could lead to different behavior on different
> > implementations, if the guarantee does not hold true (bad code).
> >
> > > Upon seeing the first instruction the decoder would know it can fuse that with the next three
> > > and simply pull the required 16 bits out of each without fully decoding them. An implementation
> > > would not be required to fuse them, of course, the first instruction is just a 'hint' providing
> > > information to allow it, but not obligate it, to fuse all four. Indeed, it is quite possible
> > > a "big core" might fuse all four while a "little core" fuses only pairs or not at all.
> > >
> > > Whether the cost of an additional opcode is worth it depends on how often you
> > > think you'll have 64 vs 32 bit immediates and how difficult it would be for the
> > > decoder to fuse four instruction without getting this new "hint" instruction.
> >
> > I have not made any fusion hardware before, so I don't know what goes as simple vs complex. However, my
> > current plan would be to construct a 64-bit constant using
> > three consecutive 32-bit instructions: An initial
> > instruction that is a pure load 20-bit immediate (sign extend),
> > and two following shift-left-22-and-insert-22-bit-immediate
> > instructions (or something along those lines). The fusion logic would need to match 6+5+5=16 bits against
> > a known pattern (i.e. the opcode fields) and ensure that three 5-bit fields are equal (i.e. the destination
> > register specifiers). It does not sound overly-complex, but I could be wrong.
>
> The complexity comes from having to align the instructions somehow - you'll need to recognize this pattern
> at different offsets in each fetch and realign instructions to decoders based on it. You'll need to correctly
> handle multiple overlapping matches and partial matches at fetch boundaries. Then each decoder will need
> wiring for 3 instructions rather than 1 and emit a wider micro-op with the extra immediate.
>
> My question is whether you have any actual data that suggests burning 27 bits of your encoding
> space on it and adding fusion is really worth it? Good ISA design is all about measuring frequencies
> of particular idioms and only adding instructions (or uarch optimizations like fusion) where
> the improved performance is worth the cost. Modern cores mostly fuse compare+branch since those
> instructions are extremely common. 64-bit integer immediates simply aren't common.
>
> Wilco
True, and I don't have a good answer yet. Since my work is just a hobby project I don't have the resources, funding or the time to do a complete analysis of every design decision. My method is more based on trial-and-error, looking at code generated by the compiler for different archs and comparing that to what I get for my arch, and doing some fairly basic frequency measurements (compiling a bunch of programs and counting instructions). As my gcc back-end isn't perfectly tuned, these measurements are not 100% accurate, either. And so on.
The encoding of immediate values (immediate loads, arithmetic ops, bitwise ops, branch offsets, address offsets, floating-point & integer, etc) is certainly one of the trickier parts, and it's a moving target (if you tweak one instruction/encoding other related instructions are affected and/or may become largely redundant, etc), but it's an interesting challenge.