By: anon2 (anon.delete@this.anon.com), April 27, 2017 10:07 am
Room: Moderated Discussions
anon (spam.delete.delete@this.this.spam.com) on April 27, 2017 8:05 am wrote:
> anon2 (anon.delete@this.anon.com) on April 27, 2017 5:57 am wrote:
> > anon (spam.delete.delete@this.this.spam.com) on April 27, 2017 4:08 am wrote:
> > > anon2 (anon.delete@this.anon.com) on April 26, 2017 6:00 pm wrote:
> > > > anon (spam.delete.delete@this.this.spam.com) on April 26, 2017 6:13 am wrote:
> > > > > anon2 (anon.delete@this.anon.com) on April 26, 2017 5:17 am wrote:
> > > > > > anon (spam.delete.delete@this.this.spam.com) on April 25, 2017 5:43 pm wrote:
> > > > > > > Paul A. Clayton (paaronclayton.delete@this.gmail.com) on April 24, 2017 9:34 am wrote:
> > > > > > > > anon (spam.delete.delete@this.this.spam.com) on April 24, 2017 1:00 am wrote:
> [snip]
> > > > > > > So if they get a group that does consist of 2 sub groups (it doesn't have to) then they can do
> > > > > > > rename in parallel. Now that obviously doesn't work if the sub groups depend on each other.
> > > > > >
> > > > > > What do you mean by a sub-group? POWER8 dispatches 6+2 groups in ST mode and 2*(3+1)
> > > > > > groups in SMT mode. SMT half groups do not depend on each other by definition.
> > > > > >
> > > > >
> > > > > The GCT (IBM's name for their weird version of an ROB) handles groups instead of single
> > > > > instructions. Each group (max 6+2 instr) consists of 1 or 2 sub groups (max 3+1 instr).
> > > > > You can get 2 subgroup groups even in SMT mode. Since they have to dispatch into the same
> > > > > slice they obviously need 2 cycles, but they only consume one entry in the GCT.
> > > >
> > > > Okay I never heard of this sub group idea, but for ST, it does not seems like a group or sub group is
> > > > restricted according to any GPR input or output, so I don't see how it looks different to rename.
> > > >
> > >
> > > One group dispatches in a single cycle in ST mode. Therefore not all instructions in it can
> > > be dispatched into the same slice.
> > > And that's when the fun starts. You could either wait
> > > for the result forwarding and kill dependency chains with latency or restrict the grouping.
> > > This ties into the ALU latency/issue ports problem, I can't rule out either option (forwarding
> > > without grouping restrictions or 2 cycle ALU with restrictions) for certain.
> >
> > So I still don't know how that looks different for rename. You have a mix of GPR inputs
> > and outputs being sent to each slice. How does that simplify rename for you?
> >
>
> Would you send instructions that depend on each other to different
> slices as much as possible because you like forwarding so much?
> Obviously not.
> If they don't you can rename in parallel.
1. "as much as possible"? Huh? Instructions go to slices not based on groups. Grouping does not solve anything.
2. Renaming still has to be synchronized otherwise how do instructions in the next cycle know how registers are mapped?
> > You say like these "sub groups" are the basis of how dispatch chooses which UQ to dispatch to,
> > but what I have seen says otherwise. For example POWER8 Core Microarchitecture whitepaper says
> >
> > "In ST mode, the two physical copies of the GPR and VSR have identical
> > contents. Instructions from the thread can be dispatched to either one of the
> > UniQueue halves (UQ0 or UQ1). Load balance across the two UniQueue
> > halves is maintained by dispatching alternate instructions of a given type to
> > alternating UniQueue halves."
> >
> > It chooses instructions only from one group per cycle, but there
> > is no idea of sub groups that go to each issue queue.
> >
> > > If I recall correctly a group always ends on a branch even in ST mode, could be that it just
> > > creates two sub groups then that can't dispatch in a single cycle. I'd have to look it up.
> > >
> > > > >
> > > > > > In SMT mode, presumably registers operate independently and rename structures operate in halves, but
> > > > > > nothing is bought for ST because it must still have the capacity to rename full dispatch width.
> > > > > >
> > > > >
> > > > > In SMT mode the slices work independently. Each got its own PRFs, its own renamer
> > > > > ("dispatcher"), its own scheduler ("issue queue") and its own execution units.
> > > > > Some special function stuff, the branch unit and front end are shared.
> > > >
> > > > Right. Nothing to do with grouping though. A non-grouping microarchitecture could do exactly the
> > > > same thing splitting the core in SMT and operating register files and renamers independently.
> > > >
> > >
> > > Yes in SMT, but not in ST.
> >
> > How does POWER8 with groups allow it in ST?
> >
>
> Grouping restrictions still apply in ST, that's the point.
The point is illogical because grouping restrictions do not allow slices to work independently in ST mode. Grouping does not buy anything.
>
> > > Do you honestly believe that IBM can
> > > just easily rename 8 instructions in a single cycle at 5 GHz?
> >
> > It can rename GPRs for 6 instructions per cycle.
> >
>
> And Intel is still at 4 on 14nm if you want to count like that. Seems strange, doesn't it?
No.
>
> > >
> > > > >
> > > > > In ST mode the PRFs have the same content so everything can execute on both slices. However because
> > > > > of the way how rename works only one group can be renamed per cycle. That means if the group consists
> > > > > of only one sub group, which it does most of the time, you only get 3+1 rename.
> > > > >
> > > > > Yes on paper it's "up to" 6+2, but that limit is rarely reached.
> > > >
> > > > That's besides the point though, if the renamer *can* map 6 GPR instructions in
> > > > one cycle. That's my point. Grouping does not buy you anything of this sort.
> > > >
> > >
> > > It's not one renamer.
> >
> > What keeps the two renamers in synch when they are being used for the same architected registers in ST mode?
> >
>
> What keeps the PRFs in sync?
Answer my question first. You are saying two renamers somehow work with grouping to reduce cost of rename. This doesn't follow.
>
> > >
> > > > >
> > > > > > When splitting the core in half, maybe the structures can operate more efficiently.
> > > > > > But that is not related to grouping but to splitting by thread.
> > > > > >
> > > > >
> > > > > Well if you want that many EUs the bypass network and all that would get extremely expensive.
> > > > > It makes perfect to do it this way for SMT, when the EUs are actually useful.
> > > > > For ST the overhead is huge compared to the actual miniscule speedup. Of course since
> > > > > all that hardware is already there you might as well use it, but if you were to design
> > > > > for ST performance this is probably the worst way of doing clustering.
> > > >
> > > > Nothing to do with grouping though.
> > > >
> > >
> > > Clustering and dispatch are coupled, dispatch and grouping are coupled.
> >
> > And execution is decoupled, by out of order issue queues. Out of order part only acts on uops.
> >
>
> How is it decoupled? One slice gets its uops from exactly one issue queue.
> Any uops in the same sub group end up in the same slice.
This assertion is wrong though. The IBM documentation says instructions are dispatched from groups based on some other policy, not a static group formation.
>
> > >
> > > > >
> > > > > > Several people are asserting that grouping does something very helpful for rename
> > > > > > on POWER8. What is it exactly? Where does this information come from?
> > > > > >
> > > > >
> > > > > As mentioned above, it enables them to use two smaller renamers
> > > > > instead of one large and insanely expensive one.
> > > >
> > > > I still don't see how this follows.
> > > >
> > >
> > > See above.
> > >
> > > > >
> > > > > >
> > > > > > > And
> > > > > > > then a subgroup still doesn't have to consist of 3+1 instructions, it could be less. You could
> > > > > > > end up with 4 instructions total anyway. Hardly worth the effort, considering how many pipeline
> > > > > > > stages grouping costs. On top of that they could still do rename like that with a single PRF.
> > > > > > >
> > > > > > > And then it goes on. 1 fixed point, 1 ld/st, 1 ld, 2 FP DP and 1 vector pipeline per slice look great
> > > > > > > on paper until you realise that the issue width is 4 and they block each other
> > > > > >
> > > > > > Block each other? How? They cause cross-unit issue stalls or result hazards or something?
> > > > > >
> > > > >
> > > > > I'm not sure how bad the port restrictions are, IBM mostly shows it as one pipe each for fixed,
> > > > > FP/vector, ld and ld/st, but I don't know enough to confirm it. Issue width is definitely 4.
> > > > > Either there are some shared ports so e.g. load and the second ALU block each other or each FU got
> > > > > its own pipe with the 2 ALUs sharing one and only being able to execute one instruction every 2 cycles
> > > > > each (not all that unlikely considering the frequency). Dependent instructions seem to take 2 cycles
> > > > > so it's either caused by that or the front end weirdness, but I'm not willing to take a bet.
> > > > >
> > > > > Either way you end up with a lower average so the 3 wide dispatch is less
> > > > > of a problem, but it looks less and less like an 8 wide design. That's why
> > > > > I said actual 6 wide would be better for ST and not any more expensive.
> > > >
> > > > No it still looks like an 8 wide design because it can sustain 8 instructions per cycle
> > > > through the pipeline. "Width" terminology is a very blunt instrument, but we can say
> > > > for certain that it's not based around measuring achieved IPC on some codes.
> > > >
> > >
> > > It can sustain 8 instruction only if you have exactly 1 branch and 1 cryptography instruction.
> > > It can dispatch 2 branches to the branch unit, but it can only issue one per cycle. Same for CR.
> >
> > CR is actually condition register unit, not crypto. For some reason I thought those would
> > group as branch instructions when I wrote that, but that's probably not right.
> >
>
> Sorry, got confused for a second, it's been a while.
> I also thought CR would be fused with BR, this way it's even worse.
>
> Crypto and decimal block another port on the issue queues I think.
> Can't quite remember the details because they are shared between both slices.
> "EDIT" Didn't want to pretend I knew this, had to look it up. DF/crypto block a VSU port.
>
> > > The two slices can issue 4, but it can only dispatch 3 to each.
> > >
> > > Don't use CR? Too bad.
> > > Want more than 1 branch every 8 cycles? Too bad.
> > > Your branches aren't space by 7 instructions? Too bad, the group ends on a branch.
> > >
> > > In SMT it's not nearly as bad but you're still limited to 7 without CR.
> > > Due to branch spacing most code is inherently limited to about 6.
> > >
> > > Now the IPC is looking quite good, isn't it? The "theoretical maximum" is lower than you'd expect.
> >
> > Sure, but width is still not IPC.
> >
>
> Of course.
> My point was that while the IPC will obviously never be 8 that number is rather
> theoretical and simply not possible without writing very weird code.
That's what "width" has always been.
> POWER8 is still nowhere near as effective at using its width in ST as in SMT mode. That's due to
> the grouping but only because of the grouping it can be so wide and only for SMT it is so wide.
This assertion is still not backed up anyhow. You handwavingly claim that grouping helps wider decode and wider rename, but as far as I can see, the reasons range from garbage to unfounded speculation.
>
> > >
> > > > >
> > > > > > > and even if they didn't
> > > > > > > the issue queue can only take 3 new instructions per cycle anyway. So you have to rely on the existence
> > > > > > > of enough 2 subgroup instructions (with enough instructions per subgroup) to get more than 3 instructions
> > > > > > > (+1 branch) issued per cycle. Even then you have to deal with all the forwarding.
> > > > > > > And now you know why POWER8 doesn't look all that hot in ST mode. Compare SMT2 with only a single thread
> > > > > > > running on it to make it fair in terms of caches with ST mode. It doesn't do all that much.
> > > > > >
> > > > > > POWER8 is clearly long in the tooth against Intel. IIRC it
> > > > > > was quite reasonable at single threaded perf against
> > > > > > similar Xeons at the time of release, but now is behind and
> > > > > > probably suffers a lot also from their much improved
> > > > > > turbo. With its high frequencies, I don't think it was ever an IPC winner there, despite its width.
> > > > > >
> > > > >
> > > > > Yes, e.g. in the Anandtech review it loses against Broadwell. In a fair comparision
> > > > > against Ivy Bridge it would've been fairly close, but it's still dissapointing
> > > > > when taking into account the ressources available per core.
> > > > >
> > > > > > But I don't know how much you can attribute to decode/dispatch restrictions or functional
> > > > > > unit issue limitations, because it really made very big gains in SMT mode to the
> > > > > > point where aggregate IPC should have been quite good at high clocks.
> > > > > >
> > > > >
> > > > > In ST it just can't really use all its execution ressources due to
> > > > > these restrictions. In SMT, which it was built for it's great.
> > > >
> > > > It was built for both.
> > > >
> > >
> > > Yes and no. The slices were a perfectly reasonable way of getting the width they wanted/needed for SMT.
> > > The weirdness on top of it was a reasonable way to make all the hardware
> > > they already needed for SMT at least somewhat useful in ST.
> > > But you wouldn't build it like that if the emphasis was
> > > on ST. Slightly narrower/fewer, but shared ressources
> > > would've been better for ST, but worse for SMT. If you get good enough ST performance this way why bother
> > > redesigning everything and handicapping SMT instead of "simply" improving upon POWER7 like they did?
> >
> > Yes and yes. You also wouldn't build it like this if there was no
> > interest in ST or no benefit for ST performance with this mode.
> >
>
> You seem to be ignoring the point.
> If there was no interest in ST but still in SMT then it would've been built exactly like this.
> Only the parts to make both slices available to a single thread would've been left out.
>
> Whereas if there was no interest in SMT (or at least not SMT8)
> then it sure as hell wouldn't have been built like this.
Huh? That's exactly my point. It is obviously built for both.
>
> > >
> > > > >
> > > > > > I think it's more likely that there is just very diminishing returns of such width for ST, combined
> > > > > > with weaknesses like bubbles in dependent ALU ops (IIRC this was improved but still had a bubble
> > > > > > somewhere, maybe had a cycle forwarding between halves), longer mispredict, longer and I think
> > > > > > more restrictive store forwarding, and all the other things that Intel does so well.
> > > > > >
> > > > >
> > > > > Yeah, it's either weird front end stuff and/or forwarding where that extra cycle comes from or 2 cycle ALUs.
> > > > >
> > > > > > >
> > > > > > > At this point one might think instruction grouping isn't really worth
> > > > > > > the effort and a slightly wider single slice might be better.
> > > > > > > Guess what they did on POWER9? Instruction grouping gone and all the pipeline stages for it.
> > > > > >
> > > > > > Well the POWER9 pipeline looks like a complete redesign. According to hotchips, it's 3 cycles
> > > > > > shorter pipeline before rename. POWER8 explicitly takes 2 stages for group formation. But that
> > > > > > seems hard to quantify exactly. Presumably grouping is done to make subsequent things easier.
> > > > > >
> > > > > > > They
> > > > > > > sacrificed 8 wide decode for that,
> > > > > >
> > > > > > Seems unlikely that removal of groups makes wide decode more difficult.
> > > > > >
> > > > >
> > > > > Well grouping does some decode work, you can't drop those cycles without either losing
> > > > > that or doing it in the other decode stages,
> > > > > which IBM seems to have decided against.
> > > >
> > > > Still doesn't follow. If that was indeed the case, then they did not "sacrifice" 8 way rename
> > > > for dropping of groups, they sacrificed it for making a shorter decode pipeline.
> > > >
> > >
> > > Should I have been more clear?
> > > Dropping grouping is not a goal in itself.
> > > They wanted a short pipeline so they dropped grouping. No grouping means more difficult decode.
> >
> > Completely disagree. You assert this and then you handwave about how grouping
> > "does some decode work" to justify it, which is just circular logic.
> >
> > > So either you add a pipeline stage, which contradicts your goal, or you accept narrower decode.
> >
> > You add a pipeline stage to make up for this alleged work done by grouping which you removed
> > at least 2 pipeline stages from, and that somehow means that grouping made decode easier?
> >
>
> Grouping is based in instruction types.
> So grouping is done before decode and somehow it magically knows the instruction type already?
What are you talking about? Please don't try to make statements with rhetorical questions.
Grouping is done before decode stages, on official pipeline description, yes. There is some predecode or early decode that helps branch prediction and group formation. So of course there is some amount of decode before that. It's very trivial to find fixed instruction types of powerpc instructions. Real decoding is turning those into uops.
> Of course decoding gets easier when you know which instruction
> type to expect, branches are in fixed positions and so on.
That's nothing enabled by groups. Quite possibly POWER9 does predecoding and has some metadata for branch prediction.
> Seriously, take the last slot as an example. What do think is simpler, a decoder that must be
> able to decode any instruction or a decoder that has to decode either a branch or nothing?
That's not an example, it's handwaving. The same work has to be done in the pipeline.
>
> Yes, most of the work done in the grouping stages is for grouping. It still helped decode a bit.
Still baseless assertion.
> So the choice is either longer logical effort effort meaning longer cycle time
> or more stages for decode, although less than with grouping, or narrower decode.
> Why bother with the first two when 8 wide decode isn't needed anymore?
Your question is based on baseless assertions.
>
> > > Since they went with actual 6 wide rename instead of "8 but it's
> > > mostly 6" the 8 wide decode became a lot less useful anyway.
> >
> > No, it's 6 for GPRs, and one for condition registers. Well two can play at such assertions,
> > so I'll say no it's due to improved efficiency from less grouping and cracking. Possibly
> > also due to going generally a bit narrower and reducing SMT emphasis per core.
> >
>
> Do you mean POWER8 or 9?
8.
> 9 should be 6 including max 2 branches, 8 is 6
> + 2 branch or condition, so I'm not seeing 1 cr rename there either.
> Either way, what good would 8 wide decode do when you can only rename 6 anyway?
Err, are we on the same planet? It is 8 wide decode to decode 6 non-branch and 2 branch instructions. Branch instructions use a single input of renamed CR of course, but given it goes to separate issue queue than the 2 unified queues with their duplicated registers, and has the explicit condition register moves and high latencies, almost certainly it is a different register file and renamer for it so not involved with 6-wide GPR rename.
> Makes sense that they dropped it.
Really? Last you said they "sacrificed" it as a necessary part of getting rid of grouping.
> anon2 (anon.delete@this.anon.com) on April 27, 2017 5:57 am wrote:
> > anon (spam.delete.delete@this.this.spam.com) on April 27, 2017 4:08 am wrote:
> > > anon2 (anon.delete@this.anon.com) on April 26, 2017 6:00 pm wrote:
> > > > anon (spam.delete.delete@this.this.spam.com) on April 26, 2017 6:13 am wrote:
> > > > > anon2 (anon.delete@this.anon.com) on April 26, 2017 5:17 am wrote:
> > > > > > anon (spam.delete.delete@this.this.spam.com) on April 25, 2017 5:43 pm wrote:
> > > > > > > Paul A. Clayton (paaronclayton.delete@this.gmail.com) on April 24, 2017 9:34 am wrote:
> > > > > > > > anon (spam.delete.delete@this.this.spam.com) on April 24, 2017 1:00 am wrote:
> [snip]
> > > > > > > So if they get a group that does consist of 2 sub groups (it doesn't have to) then they can do
> > > > > > > rename in parallel. Now that obviously doesn't work if the sub groups depend on each other.
> > > > > >
> > > > > > What do you mean by a sub-group? POWER8 dispatches 6+2 groups in ST mode and 2*(3+1)
> > > > > > groups in SMT mode. SMT half groups do not depend on each other by definition.
> > > > > >
> > > > >
> > > > > The GCT (IBM's name for their weird version of an ROB) handles groups instead of single
> > > > > instructions. Each group (max 6+2 instr) consists of 1 or 2 sub groups (max 3+1 instr).
> > > > > You can get 2 subgroup groups even in SMT mode. Since they have to dispatch into the same
> > > > > slice they obviously need 2 cycles, but they only consume one entry in the GCT.
> > > >
> > > > Okay I never heard of this sub group idea, but for ST, it does not seems like a group or sub group is
> > > > restricted according to any GPR input or output, so I don't see how it looks different to rename.
> > > >
> > >
> > > One group dispatches in a single cycle in ST mode. Therefore not all instructions in it can
> > > be dispatched into the same slice.
> > > And that's when the fun starts. You could either wait
> > > for the result forwarding and kill dependency chains with latency or restrict the grouping.
> > > This ties into the ALU latency/issue ports problem, I can't rule out either option (forwarding
> > > without grouping restrictions or 2 cycle ALU with restrictions) for certain.
> >
> > So I still don't know how that looks different for rename. You have a mix of GPR inputs
> > and outputs being sent to each slice. How does that simplify rename for you?
> >
>
> Would you send instructions that depend on each other to different
> slices as much as possible because you like forwarding so much?
> Obviously not.
> If they don't you can rename in parallel.
1. "as much as possible"? Huh? Instructions go to slices not based on groups. Grouping does not solve anything.
2. Renaming still has to be synchronized otherwise how do instructions in the next cycle know how registers are mapped?
> > You say like these "sub groups" are the basis of how dispatch chooses which UQ to dispatch to,
> > but what I have seen says otherwise. For example POWER8 Core Microarchitecture whitepaper says
> >
> > "In ST mode, the two physical copies of the GPR and VSR have identical
> > contents. Instructions from the thread can be dispatched to either one of the
> > UniQueue halves (UQ0 or UQ1). Load balance across the two UniQueue
> > halves is maintained by dispatching alternate instructions of a given type to
> > alternating UniQueue halves."
> >
> > It chooses instructions only from one group per cycle, but there
> > is no idea of sub groups that go to each issue queue.
> >
> > > If I recall correctly a group always ends on a branch even in ST mode, could be that it just
> > > creates two sub groups then that can't dispatch in a single cycle. I'd have to look it up.
> > >
> > > > >
> > > > > > In SMT mode, presumably registers operate independently and rename structures operate in halves, but
> > > > > > nothing is bought for ST because it must still have the capacity to rename full dispatch width.
> > > > > >
> > > > >
> > > > > In SMT mode the slices work independently. Each got its own PRFs, its own renamer
> > > > > ("dispatcher"), its own scheduler ("issue queue") and its own execution units.
> > > > > Some special function stuff, the branch unit and front end are shared.
> > > >
> > > > Right. Nothing to do with grouping though. A non-grouping microarchitecture could do exactly the
> > > > same thing splitting the core in SMT and operating register files and renamers independently.
> > > >
> > >
> > > Yes in SMT, but not in ST.
> >
> > How does POWER8 with groups allow it in ST?
> >
>
> Grouping restrictions still apply in ST, that's the point.
The point is illogical because grouping restrictions do not allow slices to work independently in ST mode. Grouping does not buy anything.
>
> > > Do you honestly believe that IBM can
> > > just easily rename 8 instructions in a single cycle at 5 GHz?
> >
> > It can rename GPRs for 6 instructions per cycle.
> >
>
> And Intel is still at 4 on 14nm if you want to count like that. Seems strange, doesn't it?
No.
>
> > >
> > > > >
> > > > > In ST mode the PRFs have the same content so everything can execute on both slices. However because
> > > > > of the way how rename works only one group can be renamed per cycle. That means if the group consists
> > > > > of only one sub group, which it does most of the time, you only get 3+1 rename.
> > > > >
> > > > > Yes on paper it's "up to" 6+2, but that limit is rarely reached.
> > > >
> > > > That's besides the point though, if the renamer *can* map 6 GPR instructions in
> > > > one cycle. That's my point. Grouping does not buy you anything of this sort.
> > > >
> > >
> > > It's not one renamer.
> >
> > What keeps the two renamers in synch when they are being used for the same architected registers in ST mode?
> >
>
> What keeps the PRFs in sync?
Answer my question first. You are saying two renamers somehow work with grouping to reduce cost of rename. This doesn't follow.
>
> > >
> > > > >
> > > > > > When splitting the core in half, maybe the structures can operate more efficiently.
> > > > > > But that is not related to grouping but to splitting by thread.
> > > > > >
> > > > >
> > > > > Well if you want that many EUs the bypass network and all that would get extremely expensive.
> > > > > It makes perfect to do it this way for SMT, when the EUs are actually useful.
> > > > > For ST the overhead is huge compared to the actual miniscule speedup. Of course since
> > > > > all that hardware is already there you might as well use it, but if you were to design
> > > > > for ST performance this is probably the worst way of doing clustering.
> > > >
> > > > Nothing to do with grouping though.
> > > >
> > >
> > > Clustering and dispatch are coupled, dispatch and grouping are coupled.
> >
> > And execution is decoupled, by out of order issue queues. Out of order part only acts on uops.
> >
>
> How is it decoupled? One slice gets its uops from exactly one issue queue.
> Any uops in the same sub group end up in the same slice.
This assertion is wrong though. The IBM documentation says instructions are dispatched from groups based on some other policy, not a static group formation.
>
> > >
> > > > >
> > > > > > Several people are asserting that grouping does something very helpful for rename
> > > > > > on POWER8. What is it exactly? Where does this information come from?
> > > > > >
> > > > >
> > > > > As mentioned above, it enables them to use two smaller renamers
> > > > > instead of one large and insanely expensive one.
> > > >
> > > > I still don't see how this follows.
> > > >
> > >
> > > See above.
> > >
> > > > >
> > > > > >
> > > > > > > And
> > > > > > > then a subgroup still doesn't have to consist of 3+1 instructions, it could be less. You could
> > > > > > > end up with 4 instructions total anyway. Hardly worth the effort, considering how many pipeline
> > > > > > > stages grouping costs. On top of that they could still do rename like that with a single PRF.
> > > > > > >
> > > > > > > And then it goes on. 1 fixed point, 1 ld/st, 1 ld, 2 FP DP and 1 vector pipeline per slice look great
> > > > > > > on paper until you realise that the issue width is 4 and they block each other
> > > > > >
> > > > > > Block each other? How? They cause cross-unit issue stalls or result hazards or something?
> > > > > >
> > > > >
> > > > > I'm not sure how bad the port restrictions are, IBM mostly shows it as one pipe each for fixed,
> > > > > FP/vector, ld and ld/st, but I don't know enough to confirm it. Issue width is definitely 4.
> > > > > Either there are some shared ports so e.g. load and the second ALU block each other or each FU got
> > > > > its own pipe with the 2 ALUs sharing one and only being able to execute one instruction every 2 cycles
> > > > > each (not all that unlikely considering the frequency). Dependent instructions seem to take 2 cycles
> > > > > so it's either caused by that or the front end weirdness, but I'm not willing to take a bet.
> > > > >
> > > > > Either way you end up with a lower average so the 3 wide dispatch is less
> > > > > of a problem, but it looks less and less like an 8 wide design. That's why
> > > > > I said actual 6 wide would be better for ST and not any more expensive.
> > > >
> > > > No it still looks like an 8 wide design because it can sustain 8 instructions per cycle
> > > > through the pipeline. "Width" terminology is a very blunt instrument, but we can say
> > > > for certain that it's not based around measuring achieved IPC on some codes.
> > > >
> > >
> > > It can sustain 8 instruction only if you have exactly 1 branch and 1 cryptography instruction.
> > > It can dispatch 2 branches to the branch unit, but it can only issue one per cycle. Same for CR.
> >
> > CR is actually condition register unit, not crypto. For some reason I thought those would
> > group as branch instructions when I wrote that, but that's probably not right.
> >
>
> Sorry, got confused for a second, it's been a while.
> I also thought CR would be fused with BR, this way it's even worse.
>
> Crypto and decimal block another port on the issue queues I think.
> Can't quite remember the details because they are shared between both slices.
> "EDIT" Didn't want to pretend I knew this, had to look it up. DF/crypto block a VSU port.
>
> > > The two slices can issue 4, but it can only dispatch 3 to each.
> > >
> > > Don't use CR? Too bad.
> > > Want more than 1 branch every 8 cycles? Too bad.
> > > Your branches aren't space by 7 instructions? Too bad, the group ends on a branch.
> > >
> > > In SMT it's not nearly as bad but you're still limited to 7 without CR.
> > > Due to branch spacing most code is inherently limited to about 6.
> > >
> > > Now the IPC is looking quite good, isn't it? The "theoretical maximum" is lower than you'd expect.
> >
> > Sure, but width is still not IPC.
> >
>
> Of course.
> My point was that while the IPC will obviously never be 8 that number is rather
> theoretical and simply not possible without writing very weird code.
That's what "width" has always been.
> POWER8 is still nowhere near as effective at using its width in ST as in SMT mode. That's due to
> the grouping but only because of the grouping it can be so wide and only for SMT it is so wide.
This assertion is still not backed up anyhow. You handwavingly claim that grouping helps wider decode and wider rename, but as far as I can see, the reasons range from garbage to unfounded speculation.
>
> > >
> > > > >
> > > > > > > and even if they didn't
> > > > > > > the issue queue can only take 3 new instructions per cycle anyway. So you have to rely on the existence
> > > > > > > of enough 2 subgroup instructions (with enough instructions per subgroup) to get more than 3 instructions
> > > > > > > (+1 branch) issued per cycle. Even then you have to deal with all the forwarding.
> > > > > > > And now you know why POWER8 doesn't look all that hot in ST mode. Compare SMT2 with only a single thread
> > > > > > > running on it to make it fair in terms of caches with ST mode. It doesn't do all that much.
> > > > > >
> > > > > > POWER8 is clearly long in the tooth against Intel. IIRC it
> > > > > > was quite reasonable at single threaded perf against
> > > > > > similar Xeons at the time of release, but now is behind and
> > > > > > probably suffers a lot also from their much improved
> > > > > > turbo. With its high frequencies, I don't think it was ever an IPC winner there, despite its width.
> > > > > >
> > > > >
> > > > > Yes, e.g. in the Anandtech review it loses against Broadwell. In a fair comparision
> > > > > against Ivy Bridge it would've been fairly close, but it's still dissapointing
> > > > > when taking into account the ressources available per core.
> > > > >
> > > > > > But I don't know how much you can attribute to decode/dispatch restrictions or functional
> > > > > > unit issue limitations, because it really made very big gains in SMT mode to the
> > > > > > point where aggregate IPC should have been quite good at high clocks.
> > > > > >
> > > > >
> > > > > In ST it just can't really use all its execution ressources due to
> > > > > these restrictions. In SMT, which it was built for it's great.
> > > >
> > > > It was built for both.
> > > >
> > >
> > > Yes and no. The slices were a perfectly reasonable way of getting the width they wanted/needed for SMT.
> > > The weirdness on top of it was a reasonable way to make all the hardware
> > > they already needed for SMT at least somewhat useful in ST.
> > > But you wouldn't build it like that if the emphasis was
> > > on ST. Slightly narrower/fewer, but shared ressources
> > > would've been better for ST, but worse for SMT. If you get good enough ST performance this way why bother
> > > redesigning everything and handicapping SMT instead of "simply" improving upon POWER7 like they did?
> >
> > Yes and yes. You also wouldn't build it like this if there was no
> > interest in ST or no benefit for ST performance with this mode.
> >
>
> You seem to be ignoring the point.
> If there was no interest in ST but still in SMT then it would've been built exactly like this.
> Only the parts to make both slices available to a single thread would've been left out.
>
> Whereas if there was no interest in SMT (or at least not SMT8)
> then it sure as hell wouldn't have been built like this.
Huh? That's exactly my point. It is obviously built for both.
>
> > >
> > > > >
> > > > > > I think it's more likely that there is just very diminishing returns of such width for ST, combined
> > > > > > with weaknesses like bubbles in dependent ALU ops (IIRC this was improved but still had a bubble
> > > > > > somewhere, maybe had a cycle forwarding between halves), longer mispredict, longer and I think
> > > > > > more restrictive store forwarding, and all the other things that Intel does so well.
> > > > > >
> > > > >
> > > > > Yeah, it's either weird front end stuff and/or forwarding where that extra cycle comes from or 2 cycle ALUs.
> > > > >
> > > > > > >
> > > > > > > At this point one might think instruction grouping isn't really worth
> > > > > > > the effort and a slightly wider single slice might be better.
> > > > > > > Guess what they did on POWER9? Instruction grouping gone and all the pipeline stages for it.
> > > > > >
> > > > > > Well the POWER9 pipeline looks like a complete redesign. According to hotchips, it's 3 cycles
> > > > > > shorter pipeline before rename. POWER8 explicitly takes 2 stages for group formation. But that
> > > > > > seems hard to quantify exactly. Presumably grouping is done to make subsequent things easier.
> > > > > >
> > > > > > > They
> > > > > > > sacrificed 8 wide decode for that,
> > > > > >
> > > > > > Seems unlikely that removal of groups makes wide decode more difficult.
> > > > > >
> > > > >
> > > > > Well grouping does some decode work, you can't drop those cycles without either losing
> > > > > that or doing it in the other decode stages,
> > > > > which IBM seems to have decided against.
> > > >
> > > > Still doesn't follow. If that was indeed the case, then they did not "sacrifice" 8 way rename
> > > > for dropping of groups, they sacrificed it for making a shorter decode pipeline.
> > > >
> > >
> > > Should I have been more clear?
> > > Dropping grouping is not a goal in itself.
> > > They wanted a short pipeline so they dropped grouping. No grouping means more difficult decode.
> >
> > Completely disagree. You assert this and then you handwave about how grouping
> > "does some decode work" to justify it, which is just circular logic.
> >
> > > So either you add a pipeline stage, which contradicts your goal, or you accept narrower decode.
> >
> > You add a pipeline stage to make up for this alleged work done by grouping which you removed
> > at least 2 pipeline stages from, and that somehow means that grouping made decode easier?
> >
>
> Grouping is based in instruction types.
> So grouping is done before decode and somehow it magically knows the instruction type already?
What are you talking about? Please don't try to make statements with rhetorical questions.
Grouping is done before decode stages, on official pipeline description, yes. There is some predecode or early decode that helps branch prediction and group formation. So of course there is some amount of decode before that. It's very trivial to find fixed instruction types of powerpc instructions. Real decoding is turning those into uops.
> Of course decoding gets easier when you know which instruction
> type to expect, branches are in fixed positions and so on.
That's nothing enabled by groups. Quite possibly POWER9 does predecoding and has some metadata for branch prediction.
> Seriously, take the last slot as an example. What do think is simpler, a decoder that must be
> able to decode any instruction or a decoder that has to decode either a branch or nothing?
That's not an example, it's handwaving. The same work has to be done in the pipeline.
>
> Yes, most of the work done in the grouping stages is for grouping. It still helped decode a bit.
Still baseless assertion.
> So the choice is either longer logical effort effort meaning longer cycle time
> or more stages for decode, although less than with grouping, or narrower decode.
> Why bother with the first two when 8 wide decode isn't needed anymore?
Your question is based on baseless assertions.
>
> > > Since they went with actual 6 wide rename instead of "8 but it's
> > > mostly 6" the 8 wide decode became a lot less useful anyway.
> >
> > No, it's 6 for GPRs, and one for condition registers. Well two can play at such assertions,
> > so I'll say no it's due to improved efficiency from less grouping and cracking. Possibly
> > also due to going generally a bit narrower and reducing SMT emphasis per core.
> >
>
> Do you mean POWER8 or 9?
8.
> 9 should be 6 including max 2 branches, 8 is 6
> + 2 branch or condition, so I'm not seeing 1 cr rename there either.
> Either way, what good would 8 wide decode do when you can only rename 6 anyway?
Err, are we on the same planet? It is 8 wide decode to decode 6 non-branch and 2 branch instructions. Branch instructions use a single input of renamed CR of course, but given it goes to separate issue queue than the 2 unified queues with their duplicated registers, and has the explicit condition register moves and high latencies, almost certainly it is a different register file and renamer for it so not involved with 6-wide GPR rename.
> Makes sense that they dropped it.
Really? Last you said they "sacrificed" it as a necessary part of getting rid of grouping.