By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), June 16, 2022 1:04 pm
Room: Moderated Discussions
Mark Roulo (nothanks.delete@this.xxx.com) on June 14, 2022 1:56 pm wrote:
> Paul A. Clayton (paaronclayton.delete@this.gmail.com) on June 14, 2022 12:27 pm wrote:
[snip]
>> Redundancy can provide reasonable yields even with somewhat
>> high defect rates. Even for memory arrays, the locality
>> of defects/extreme variation can bias whether row/column spares or array spares are more attractive.
[snip]
> In theory one can make a design/implementation which is AMAZINGLY robust
> against manufacturing defects. Triply (or more) redundant everything.
>
> Though this would likely lead to worse performance.
Even the much more modest redundancy (N+1 for many structures beyond memory arrays, possibly even N+2) I mentioned would have had performance costs — and I also mentioned the possibility of sacrificing performance and power for manufacturability — increasing area will increase communication latency and opportunity for clock skew (and possibly jitter?). Even the design effort will cost performance.
(Margin for variability also impacts yield. Frequency binning is one technique available, but — if I understand correctly — low- and intermediate-level design can increase tolerance for variability at the cost of area, power, performance, and design effort. Similarly, performance et al. can be tweaked by sacrificing yield. With high mask set costs, a low-volume product might generate more profit with higher performance but lower yield.)
> In practice, folks are already providing redundancy where it is practical.
Where it is practical depends on a process' variability and defect rate and the design/market goals for the product. Non-foundry Intel had more opportunity to shift costs between the process, low-level design, and even (somewhat) product target. The accounting, as I implied, could be even less clearly segmented.
> Memory regions are a good example of this (standalone DRAM and FLASH as well as caches).
I was under the impression that column (row?) based redundancy was standard for SRAM arrays. Array-level redundancy was used by one of the Itanium implementations and presumably would make sense in some circumstances.
[snip]
> Notice that NVidia is losing a full 15% of their manufactured FLOPS just to allow the part to work.
>
> NVidia gets away with this because the margins on the A100 chips are very
> high. But this is probably a bad plan for laptop and desktop chips.
Even for lower-margin chips, 15% area cost is not necessarily that important if one is gaining effectively 80% more area (with possible integration benefits) from a new process and gaining performance and power benefits from that process.
Obviously a GPU or sea-of-cores design can more trivially exploit coarse-grained redundancy (which will have less impact on local latency). For Intel, leading in performance or energy efficiency is important for brand reasons, so finer-grained redundancy may be especially unattractive (and even in the personal computer processor market, binning can complicate power-performance-area-yield decisions). If one knows well in advance that all of one's product will be produced in
(Radiation tolerance — and tolerance to other "environmental" factors, including temperature and voltage stability — would seem to also interact with design for manufacturing variability/defectivity. This implies that the tradeoffs might be significantly different for different products.)
Design effort and changing design orientation will also introduce risk, so one would not expect universal adoption of different methods even within one organization.
If defectivity/variation was unexpectedly high and persistent (possibly even from trying to tweak a process to improve performance or energy efficiency), the designs ready for this new cost structure would not be suited to the unexpected tradeoffs.
If the information was available, the design space for manufacturability tradeoffs with fixed and incremental costs and product value (performance, energy efficiency) might be a good PhD thesis topic — or more likely long-term research interest.
> Paul A. Clayton (paaronclayton.delete@this.gmail.com) on June 14, 2022 12:27 pm wrote:
[snip]
>> Redundancy can provide reasonable yields even with somewhat
>> high defect rates. Even for memory arrays, the locality
>> of defects/extreme variation can bias whether row/column spares or array spares are more attractive.
[snip]
> In theory one can make a design/implementation which is AMAZINGLY robust
> against manufacturing defects. Triply (or more) redundant everything.
>
> Though this would likely lead to worse performance.
Even the much more modest redundancy (N+1 for many structures beyond memory arrays, possibly even N+2) I mentioned would have had performance costs — and I also mentioned the possibility of sacrificing performance and power for manufacturability — increasing area will increase communication latency and opportunity for clock skew (and possibly jitter?). Even the design effort will cost performance.
(Margin for variability also impacts yield. Frequency binning is one technique available, but — if I understand correctly — low- and intermediate-level design can increase tolerance for variability at the cost of area, power, performance, and design effort. Similarly, performance et al. can be tweaked by sacrificing yield. With high mask set costs, a low-volume product might generate more profit with higher performance but lower yield.)
> In practice, folks are already providing redundancy where it is practical.
Where it is practical depends on a process' variability and defect rate and the design/market goals for the product. Non-foundry Intel had more opportunity to shift costs between the process, low-level design, and even (somewhat) product target. The accounting, as I implied, could be even less clearly segmented.
> Memory regions are a good example of this (standalone DRAM and FLASH as well as caches).
I was under the impression that column (row?) based redundancy was standard for SRAM arrays. Array-level redundancy was used by one of the Itanium implementations and presumably would make sense in some circumstances.
[snip]
> Notice that NVidia is losing a full 15% of their manufactured FLOPS just to allow the part to work.
>
> NVidia gets away with this because the margins on the A100 chips are very
> high. But this is probably a bad plan for laptop and desktop chips.
Even for lower-margin chips, 15% area cost is not necessarily that important if one is gaining effectively 80% more area (with possible integration benefits) from a new process and gaining performance and power benefits from that process.
Obviously a GPU or sea-of-cores design can more trivially exploit coarse-grained redundancy (which will have less impact on local latency). For Intel, leading in performance or energy efficiency is important for brand reasons, so finer-grained redundancy may be especially unattractive (and even in the personal computer processor market, binning can complicate power-performance-area-yield decisions). If one knows well in advance that all of one's product will be produced in
(Radiation tolerance — and tolerance to other "environmental" factors, including temperature and voltage stability — would seem to also interact with design for manufacturing variability/defectivity. This implies that the tradeoffs might be significantly different for different products.)
Design effort and changing design orientation will also introduce risk, so one would not expect universal adoption of different methods even within one organization.
If defectivity/variation was unexpectedly high and persistent (possibly even from trying to tweak a process to improve performance or energy efficiency), the designs ready for this new cost structure would not be suited to the unexpected tradeoffs.
If the information was available, the design space for manufacturability tradeoffs with fixed and incremental costs and product value (performance, energy efficiency) might be a good PhD thesis topic — or more likely long-term research interest.