By: anon2 (anon.delete@this.anon.com), April 26, 2017 5:17 am
Room: Moderated Discussions
anon (spam.delete.delete@this.this.spam.com) on April 25, 2017 5:43 pm wrote:
> Paul A. Clayton (paaronclayton.delete@this.gmail.com) on April 24, 2017 9:34 am wrote:
> > anon (spam.delete.delete@this.this.spam.com) on April 24, 2017 1:00 am wrote:
> > > Paul A. Clayton (paaronclayton.delete@this.gmail.com) on April 23, 2017 6:58 pm wrote:
> > > > Alpha 21264 (duplicated GPRs) was utter insane?! (I think not)
> > >
> > > I cut that sentence a bit.
> > > It's insane to use it to get wider rename. Obviously 80 registers are cheaper than 200,
> > > but nonetheless they didn't do it to get wider rename, they wanted lower latencies and
> > > fewer ports on the register files, which does offset the cost of duplicating it.
> > >
> > > Duplicating the PRFs but keeping the number of ports the same would be insane.
> > >
> > > Clustering is viable, I mean that's exactly what IBM is doing with the POWER9.
> >
> > I am not familiar with the POWER8 microarchitecture. (Sadly for mere enthusiasts, IBM has chosen to
> > put its Journal of Research and Development behind a pay wall. I have also been doing a lot less reading
> > and my to-be-read pile is becoming depressingly huge (and I am a slow reader!).) I would assume that
> > the advantages of a clustered design were exploited in POWER8. Clustering renaming and scheduling could
> > have similar advantages as well as fitting well with partitioning for multithreading.
> >
>
> I'm just going to put this here, don't feel like splitting up the thread even more.
>
> My problem with POWER8 is that this clustering really isn't for ST.
> Duplicating the PRF does absolutely nothing for rename. They get that from instruction grouping.
Get what?
> So if they get a group that does consist of 2 sub groups (it doesn't have to) then they can do
> rename in parallel. Now that obviously doesn't work if the sub groups depend on each other.
What do you mean by a sub-group? POWER8 dispatches 6+2 groups in ST mode and 2*(3+1) groups in SMT mode. SMT half groups do not depend on each other by definition.
In SMT mode, presumably registers operate independently and rename structures operate in halves, but nothing is bought for ST because it must still have the capacity to rename full dispatch width.
When splitting the core in half, maybe the structures can operate more efficiently. But that is not related to grouping but to splitting by thread.
Several people are asserting that grouping does something very helpful for rename on POWER8. What is it exactly? Where does this information come from?
> And
> then a subgroup still doesn't have to consist of 3+1 instructions, it could be less. You could
> end up with 4 instructions total anyway. Hardly worth the effort, considering how many pipeline
> stages grouping costs. On top of that they could still do rename like that with a single PRF.
>
> And then it goes on. 1 fixed point, 1 ld/st, 1 ld, 2 FP DP and 1 vector pipeline per slice look great
> on paper until you realise that the issue width is 4 and they block each other
Block each other? How? They cause cross-unit issue stalls or result hazards or something?
> and even if they didn't
> the issue queue can only take 3 new instructions per cycle anyway. So you have to rely on the existence
> of enough 2 subgroup instructions (with enough instructions per subgroup) to get more than 3 instructions
> (+1 branch) issued per cycle. Even then you have to deal with all the forwarding.
> And now you know why POWER8 doesn't look all that hot in ST mode. Compare SMT2 with only a single thread
> running on it to make it fair in terms of caches with ST mode. It doesn't do all that much.
POWER8 is clearly long in the tooth against Intel. IIRC it was quite reasonable at single threaded perf against similar Xeons at the time of release, but now is behind and probably suffers a lot also from their much improved turbo. With its high frequencies, I don't think it was ever an IPC winner there, despite its width.
But I don't know how much you can attribute to decode/dispatch restrictions or functional unit issue limitations, because it really made very big gains in SMT mode to the point where aggregate IPC should have been quite good at high clocks.
I think it's more likely that there is just very diminishing returns of such width for ST, combined with weaknesses like bubbles in dependent ALU ops (IIRC this was improved but still had a bubble somewhere, maybe had a cycle forwarding between halves), longer mispredict, longer and I think more restrictive store forwarding, and all the other things that Intel does so well.
>
> At this point one might think instruction grouping isn't really worth
> the effort and a slightly wider single slice might be better.
> Guess what they did on POWER9? Instruction grouping gone and all the pipeline stages for it.
Well the POWER9 pipeline looks like a complete redesign. According to hotchips, it's 3 cycles shorter pipeline before rename. POWER8 explicitly takes 2 stages for group formation. But that seems hard to quantify exactly. Presumably grouping is done to make subsequent things easier.
> They
> sacrificed 8 wide decode for that,
Seems unlikely that removal of groups makes wide decode more difficult.
[snip]
> On that note apparently every company has to try their hand at a speed demon,
> fail horribly and then deliver a great architecture on the next try. It seems
> strange that Intel, IBM and AMD all did the same in such a short timeframe.
I don't, but I doubt they just one day decided they should try such a thing. More like they reached similar conclusions based on similar data and technology projections.
> Paul A. Clayton (paaronclayton.delete@this.gmail.com) on April 24, 2017 9:34 am wrote:
> > anon (spam.delete.delete@this.this.spam.com) on April 24, 2017 1:00 am wrote:
> > > Paul A. Clayton (paaronclayton.delete@this.gmail.com) on April 23, 2017 6:58 pm wrote:
> > > > Alpha 21264 (duplicated GPRs) was utter insane?! (I think not)
> > >
> > > I cut that sentence a bit.
> > > It's insane to use it to get wider rename. Obviously 80 registers are cheaper than 200,
> > > but nonetheless they didn't do it to get wider rename, they wanted lower latencies and
> > > fewer ports on the register files, which does offset the cost of duplicating it.
> > >
> > > Duplicating the PRFs but keeping the number of ports the same would be insane.
> > >
> > > Clustering is viable, I mean that's exactly what IBM is doing with the POWER9.
> >
> > I am not familiar with the POWER8 microarchitecture. (Sadly for mere enthusiasts, IBM has chosen to
> > put its Journal of Research and Development behind a pay wall. I have also been doing a lot less reading
> > and my to-be-read pile is becoming depressingly huge (and I am a slow reader!).) I would assume that
> > the advantages of a clustered design were exploited in POWER8. Clustering renaming and scheduling could
> > have similar advantages as well as fitting well with partitioning for multithreading.
> >
>
> I'm just going to put this here, don't feel like splitting up the thread even more.
>
> My problem with POWER8 is that this clustering really isn't for ST.
> Duplicating the PRF does absolutely nothing for rename. They get that from instruction grouping.
Get what?
> So if they get a group that does consist of 2 sub groups (it doesn't have to) then they can do
> rename in parallel. Now that obviously doesn't work if the sub groups depend on each other.
What do you mean by a sub-group? POWER8 dispatches 6+2 groups in ST mode and 2*(3+1) groups in SMT mode. SMT half groups do not depend on each other by definition.
In SMT mode, presumably registers operate independently and rename structures operate in halves, but nothing is bought for ST because it must still have the capacity to rename full dispatch width.
When splitting the core in half, maybe the structures can operate more efficiently. But that is not related to grouping but to splitting by thread.
Several people are asserting that grouping does something very helpful for rename on POWER8. What is it exactly? Where does this information come from?
> And
> then a subgroup still doesn't have to consist of 3+1 instructions, it could be less. You could
> end up with 4 instructions total anyway. Hardly worth the effort, considering how many pipeline
> stages grouping costs. On top of that they could still do rename like that with a single PRF.
>
> And then it goes on. 1 fixed point, 1 ld/st, 1 ld, 2 FP DP and 1 vector pipeline per slice look great
> on paper until you realise that the issue width is 4 and they block each other
Block each other? How? They cause cross-unit issue stalls or result hazards or something?
> and even if they didn't
> the issue queue can only take 3 new instructions per cycle anyway. So you have to rely on the existence
> of enough 2 subgroup instructions (with enough instructions per subgroup) to get more than 3 instructions
> (+1 branch) issued per cycle. Even then you have to deal with all the forwarding.
> And now you know why POWER8 doesn't look all that hot in ST mode. Compare SMT2 with only a single thread
> running on it to make it fair in terms of caches with ST mode. It doesn't do all that much.
POWER8 is clearly long in the tooth against Intel. IIRC it was quite reasonable at single threaded perf against similar Xeons at the time of release, but now is behind and probably suffers a lot also from their much improved turbo. With its high frequencies, I don't think it was ever an IPC winner there, despite its width.
But I don't know how much you can attribute to decode/dispatch restrictions or functional unit issue limitations, because it really made very big gains in SMT mode to the point where aggregate IPC should have been quite good at high clocks.
I think it's more likely that there is just very diminishing returns of such width for ST, combined with weaknesses like bubbles in dependent ALU ops (IIRC this was improved but still had a bubble somewhere, maybe had a cycle forwarding between halves), longer mispredict, longer and I think more restrictive store forwarding, and all the other things that Intel does so well.
>
> At this point one might think instruction grouping isn't really worth
> the effort and a slightly wider single slice might be better.
> Guess what they did on POWER9? Instruction grouping gone and all the pipeline stages for it.
Well the POWER9 pipeline looks like a complete redesign. According to hotchips, it's 3 cycles shorter pipeline before rename. POWER8 explicitly takes 2 stages for group formation. But that seems hard to quantify exactly. Presumably grouping is done to make subsequent things easier.
> They
> sacrificed 8 wide decode for that,
Seems unlikely that removal of groups makes wide decode more difficult.
[snip]
> On that note apparently every company has to try their hand at a speed demon,
> fail horribly and then deliver a great architecture on the next try. It seems
> strange that Intel, IBM and AMD all did the same in such a short timeframe.
I don't, but I doubt they just one day decided they should try such a thing. More like they reached similar conclusions based on similar data and technology projections.