By: Maynard Handley (name99.delete@this.name99.org), November 4, 2019 7:17 pm
Room: Moderated Discussions
Wilco (wilco.dijkstra.delete@this.ntlworld.com) on November 4, 2019 1:49 pm wrote:
> David Hess (davidwhess.delete@this.gmail.com) on November 4, 2019 10:05 am wrote:
> > j (invalid.delete@this.example.net) on November 3, 2019 11:30 pm wrote:
> > >
> > > Sorry, poor wording on my part. I didn't mean anything related to hypervisors, virtual
> > > machines or the like. What I meant was whatever microarchitectural tricks which are
> > > needed to avoid bottle-necking performance on writing/reading the flags register.
> > >
> > > AFAIK one common approach is to split up the flags register into several "virtual" registers (hence
> > > my use of the term "virtualizing") that can be handled separately. I don't know if they go all the way
> > > to having one such virtual register for each bit in the flags register, or if they are grouped.
> > >
> > > Another might be to duplicate the flags, so that each normal
> > > register would have an associated flags register
> > > containing the flags for the latest instruction that wrote to that register, and then instructions that
> > > depend on the flag register such as jumps would somehow need to pick up a dependency on the correct flags
> > > register. Or maybe this is pointless if you just rename the flags as any normal register?
> >
> > Maybe you are the person to ask. Why not implement a duplicate
> > set of narrow registers to hold all ALU flag
> > results? Some ISAs (Power?) implement multiple addressable flag registers but instead, extend this to a
> > full set in parallel with the register file removing the need to separately address them on stores.
> >
> > Not only would store addressing be free since it is just the register address, but
> > reads from the parallel register file holding the flags do not compete with regular
> > register file reads. For instance if the zero flag was saved even though it can be
> > computed at any time, then a test for zero does not require register file access.
>
> However you'd need 5 extra bits to specify which register produced the flags in any instruction consuming
> flags. And that is particularly problematic for branches. Then there is the correctness/security aspect
> of code relying on the flags across calls, so you'd need to clear them explicitly.
>
> Wilco
There seem to be three orthogonal issues here:
- have a well-defined set of flags and modify them all or nothing. Intel made mistakes here; ARM and IBM both have 4 flags that all change or don't.
- only set flags if you care about them. Again Intel is not great here; IBM did very nicely, and ARM likewise. (Though IBM wins with much nicer assembly syntax!)
- have ISA-visible multiple sets of flags. Obviously IBM wins here in terms of thinking of the idea and supporting it all the way from 8 sets to having sets 0 and 1 independently default for INT and FP. And I'd love to say Yay IBM. BUT it is a depressing fact that in many many years of writing PPC assembly (and closely examining the assembly generated by the compiler) I almost never saw any real use case for this...
There were a very few places (I remember, for example, some inner loops of the JPEG2000 codec) where one could calculate one set of flags outside a loop and keep those valid (and testable) inside the loop along with the default integer set 0 of flags. But so rare -- and probably not that big a deal given how wide machines were even back then, and the constraints on which instructions could execute simultaneously.
- ARM also wins over IBM in that they provide the kindof "test and test again in one instruction" specializations that look weird and non-orthogonal, but actually match what one almost always wants. This is, perhaps, their value equivalent of multiple flag sets with the ability to perform two comparisons in a cycle?
- finally what about the bignum arithmetic problem? As long as you have a decent OoO machine and renaming flags (ie any ARM core of actual interest) how problematic is it? If the bignums are short you fill the instruction queue with multiple independent streams of arithmetic (all using renamed registers and flags) and dataflow takes over. If you have really large bignums, so that we're talking filling ~60+ instruction queue with dependent instructions, yeah, in that case life is probably suboptimal, and you'll lag behind IBM and Intel.
I'm not sure how common that case is (whether in HPC, in crypto, or in Mathematica) to design for it. (OR if, in any case that actually matters, you're doing lots of these via vectors, and whats ACTUALLY important is how your vector ISA handles carries...)
Given all this, it's hard to fault what ARM did with AArch64. MAYBE if you write a lot of clever FP code, there's value in having both an integer and an FP flag set (basically adopt IBM's set 0 and set 1 approach)? I don't have enough FP experience to comment (and there are other things in every FP design, including ARM's, that strike me as far more problematic and thus much higher priority).
> David Hess (davidwhess.delete@this.gmail.com) on November 4, 2019 10:05 am wrote:
> > j (invalid.delete@this.example.net) on November 3, 2019 11:30 pm wrote:
> > >
> > > Sorry, poor wording on my part. I didn't mean anything related to hypervisors, virtual
> > > machines or the like. What I meant was whatever microarchitectural tricks which are
> > > needed to avoid bottle-necking performance on writing/reading the flags register.
> > >
> > > AFAIK one common approach is to split up the flags register into several "virtual" registers (hence
> > > my use of the term "virtualizing") that can be handled separately. I don't know if they go all the way
> > > to having one such virtual register for each bit in the flags register, or if they are grouped.
> > >
> > > Another might be to duplicate the flags, so that each normal
> > > register would have an associated flags register
> > > containing the flags for the latest instruction that wrote to that register, and then instructions that
> > > depend on the flag register such as jumps would somehow need to pick up a dependency on the correct flags
> > > register. Or maybe this is pointless if you just rename the flags as any normal register?
> >
> > Maybe you are the person to ask. Why not implement a duplicate
> > set of narrow registers to hold all ALU flag
> > results? Some ISAs (Power?) implement multiple addressable flag registers but instead, extend this to a
> > full set in parallel with the register file removing the need to separately address them on stores.
> >
> > Not only would store addressing be free since it is just the register address, but
> > reads from the parallel register file holding the flags do not compete with regular
> > register file reads. For instance if the zero flag was saved even though it can be
> > computed at any time, then a test for zero does not require register file access.
>
> However you'd need 5 extra bits to specify which register produced the flags in any instruction consuming
> flags. And that is particularly problematic for branches. Then there is the correctness/security aspect
> of code relying on the flags across calls, so you'd need to clear them explicitly.
>
> Wilco
There seem to be three orthogonal issues here:
- have a well-defined set of flags and modify them all or nothing. Intel made mistakes here; ARM and IBM both have 4 flags that all change or don't.
- only set flags if you care about them. Again Intel is not great here; IBM did very nicely, and ARM likewise. (Though IBM wins with much nicer assembly syntax!)
- have ISA-visible multiple sets of flags. Obviously IBM wins here in terms of thinking of the idea and supporting it all the way from 8 sets to having sets 0 and 1 independently default for INT and FP. And I'd love to say Yay IBM. BUT it is a depressing fact that in many many years of writing PPC assembly (and closely examining the assembly generated by the compiler) I almost never saw any real use case for this...
There were a very few places (I remember, for example, some inner loops of the JPEG2000 codec) where one could calculate one set of flags outside a loop and keep those valid (and testable) inside the loop along with the default integer set 0 of flags. But so rare -- and probably not that big a deal given how wide machines were even back then, and the constraints on which instructions could execute simultaneously.
- ARM also wins over IBM in that they provide the kindof "test and test again in one instruction" specializations that look weird and non-orthogonal, but actually match what one almost always wants. This is, perhaps, their value equivalent of multiple flag sets with the ability to perform two comparisons in a cycle?
- finally what about the bignum arithmetic problem? As long as you have a decent OoO machine and renaming flags (ie any ARM core of actual interest) how problematic is it? If the bignums are short you fill the instruction queue with multiple independent streams of arithmetic (all using renamed registers and flags) and dataflow takes over. If you have really large bignums, so that we're talking filling ~60+ instruction queue with dependent instructions, yeah, in that case life is probably suboptimal, and you'll lag behind IBM and Intel.
I'm not sure how common that case is (whether in HPC, in crypto, or in Mathematica) to design for it. (OR if, in any case that actually matters, you're doing lots of these via vectors, and whats ACTUALLY important is how your vector ISA handles carries...)
Given all this, it's hard to fault what ARM did with AArch64. MAYBE if you write a lot of clever FP code, there's value in having both an integer and an FP flag set (basically adopt IBM's set 0 and set 1 approach)? I don't have enough FP experience to comment (and there are other things in every FP design, including ARM's, that strike me as far more problematic and thus much higher priority).