By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), April 24, 2017 9:34 am
Room: Moderated Discussions
anon (spam.delete.delete@this.this.spam.com) on April 24, 2017 1:00 am wrote:
> Paul A. Clayton (paaronclayton.delete@this.gmail.com) on April 23, 2017 6:58 pm wrote:
> > Alpha 21264 (duplicated GPRs) was utter insane?! (I think not)
>
> I cut that sentence a bit.
> It's insane to use it to get wider rename. Obviously 80 registers are cheaper than 200,
> but nonetheless they didn't do it to get wider rename, they wanted lower latencies and
> fewer ports on the register files, which does offset the cost of duplicating it.
>
> Duplicating the PRFs but keeping the number of ports the same would be insane.
>
> Clustering is viable, I mean that's exactly what IBM is doing with the POWER9.
I am not familiar with the POWER8 microarchitecture. (Sadly for mere enthusiasts, IBM has chosen to put its Journal of Research and Development behind a pay wall. I have also been doing a lot less reading and my to-be-read pile is becoming depressingly huge (and I am a slow reader!).) I would assume that the advantages of a clustered design were exploited in POWER8. Clustering renaming and scheduling could have similar advantages as well as fitting well with partitioning for multithreading.
If replication would provide a power/area/latency benefit, exploiting its presence for other mechanisms (to reduce the isolated implementation cost or increase benefits) seems reasonable (not "insane") even at a modest extra cost.
Side note: In terms of clustering, it is a bit disappointing that such is not extended to caches and assisted by software (for both registers and caches). Even with modest compiler changes and no changes in other software infrastructure, there is probably enough regularity in data use for software to provide some assistance in data location. The fact that cache banking seems to work so well hints that more extensive partitioning might not merely be practical but potentially beneficial.
Not being a hardware designer, I do not know how helpful cache clustering (with minimal or no replication) would be in terms of floorplan. Obviously reducing the physical distance in critical loops is desirable, but if such forces less critical loops to have a much greater latency penalty (or substantially hinders manufacturability or bloats area) then it might not be beneficial in that larger context.
Given that intra-block NUCA for largish L1 caches (where latency of, e.g., a quarter of a cache block is lower) has not been implemented, there are presumably reasons that such optimizations are rejected. I think such would provide some benefit even without software changes and would not necessarily hurt in the more general context (though it would constrain other implementation choices, e.g., way hit information/prediction would need to be available earlier). AMD's Bulldozer cache design seems ill-conceived (small L1 data caches with largish L2 caches with uniform latency between cores) to an outsider with almost no knowledge of hardware design, but it was presumably a conscious choice by professional hardware designers given a variety of constraints. Sadly, "What were they thinking?" (or "Why was this alternative rejected?") is rarely publicly documented, especially when choices do not work out well. (No one seems to have the free time (or permission?) to publish such information even years later. Although most of the lessons would just in be common project management wisdom, fresh anecdotes would be entertaining and some technical education would be provided. This is a significant disadvantage of having an uncommon interest in a highly professionalized and competitive area. Popular interests can fund low cost edu-tainment. Less popular but accessible topics at least have lower cost and skill requirements for developing documentation. Competition discourages sharing of information, which can reduce competitive advantages, so some of the practical details of processor (and computer) architecture are less available. Perhaps I can take some comfort in that interest in processor architecture is still probably more easily satisfied than an interest in particle physics.☺)
> Paul A. Clayton (paaronclayton.delete@this.gmail.com) on April 23, 2017 6:58 pm wrote:
> > Alpha 21264 (duplicated GPRs) was utter insane?! (I think not)
>
> I cut that sentence a bit.
> It's insane to use it to get wider rename. Obviously 80 registers are cheaper than 200,
> but nonetheless they didn't do it to get wider rename, they wanted lower latencies and
> fewer ports on the register files, which does offset the cost of duplicating it.
>
> Duplicating the PRFs but keeping the number of ports the same would be insane.
>
> Clustering is viable, I mean that's exactly what IBM is doing with the POWER9.
I am not familiar with the POWER8 microarchitecture. (Sadly for mere enthusiasts, IBM has chosen to put its Journal of Research and Development behind a pay wall. I have also been doing a lot less reading and my to-be-read pile is becoming depressingly huge (and I am a slow reader!).) I would assume that the advantages of a clustered design were exploited in POWER8. Clustering renaming and scheduling could have similar advantages as well as fitting well with partitioning for multithreading.
If replication would provide a power/area/latency benefit, exploiting its presence for other mechanisms (to reduce the isolated implementation cost or increase benefits) seems reasonable (not "insane") even at a modest extra cost.
Side note: In terms of clustering, it is a bit disappointing that such is not extended to caches and assisted by software (for both registers and caches). Even with modest compiler changes and no changes in other software infrastructure, there is probably enough regularity in data use for software to provide some assistance in data location. The fact that cache banking seems to work so well hints that more extensive partitioning might not merely be practical but potentially beneficial.
Not being a hardware designer, I do not know how helpful cache clustering (with minimal or no replication) would be in terms of floorplan. Obviously reducing the physical distance in critical loops is desirable, but if such forces less critical loops to have a much greater latency penalty (or substantially hinders manufacturability or bloats area) then it might not be beneficial in that larger context.
Given that intra-block NUCA for largish L1 caches (where latency of, e.g., a quarter of a cache block is lower) has not been implemented, there are presumably reasons that such optimizations are rejected. I think such would provide some benefit even without software changes and would not necessarily hurt in the more general context (though it would constrain other implementation choices, e.g., way hit information/prediction would need to be available earlier). AMD's Bulldozer cache design seems ill-conceived (small L1 data caches with largish L2 caches with uniform latency between cores) to an outsider with almost no knowledge of hardware design, but it was presumably a conscious choice by professional hardware designers given a variety of constraints. Sadly, "What were they thinking?" (or "Why was this alternative rejected?") is rarely publicly documented, especially when choices do not work out well. (No one seems to have the free time (or permission?) to publish such information even years later. Although most of the lessons would just in be common project management wisdom, fresh anecdotes would be entertaining and some technical education would be provided. This is a significant disadvantage of having an uncommon interest in a highly professionalized and competitive area. Popular interests can fund low cost edu-tainment. Less popular but accessible topics at least have lower cost and skill requirements for developing documentation. Competition discourages sharing of information, which can reduce competitive advantages, so some of the practical details of processor (and computer) architecture are less available. Perhaps I can take some comfort in that interest in processor architecture is still probably more easily satisfied than an interest in particle physics.☺)