By: David Hess (davidwhess.delete@this.gmail.com), September 17, 2022 9:03 am
Room: Moderated Discussions
anon2 (anon.delete@this.anon.com) on September 15, 2022 7:04 pm wrote:
>
> A very long time ago I recall some CPUs had a bios selection between write-back and write-through
> L1, possibly integrity was the reason.
The Pentium 2 and 3 had that. I thought it was for some obscure compatibility reason.
> More recently Intel used a "DCU 16kB mode" option in
> its Xeons. This changed the data cache unit from 32kB 8-way associative, to mirrored 16kB 4-way
> halves and ECC achieved with parity finding correct copy. This seems to have gone away in favor
> of an allegedly more robust L1D sram cell and they have no ECC on writeback L1.
The way I understand it is that resistance to soft errors in the SRAM cell depends on the amount of stored charge, but higher stored charge requires greater power to perform a write operation because the drivers have to overcome the stored charge to perform the write. There are various ways to "unlock" the SRAM cell for the write, but they require more transistors and more space. In the case of CPU cache, access frequency is very high so the extra charge which would make the SRAM resistant to soft errors has a high cost in power.
> I have no issue with this. Reliability is limited by chance of more than correctable bitflips,
> if 1 bitflip has very small chance then reliability can be fine. I'm no array designer but it
> does seem like at some point at the very high end of reliability, having ECC would be better
> than increasing bit reliability. But perhaps for Xeon reliability goal that is enough.
I was thinking that the performance loss from having to do a read, modify, and write cycle was eventually high enough that the extra cost of implemented SRAM cache with the ability to have low power writes but retain high charge and low soft error rate became a worthwhile tradeoff.
I do not know how it would fit in here, but the high access frequency also contributes to increasing the soft error rate. The same memory accesses at a slower rate has fewer soft errors, so the L1 cache is particularly vulnerable.
>
> A very long time ago I recall some CPUs had a bios selection between write-back and write-through
> L1, possibly integrity was the reason.
The Pentium 2 and 3 had that. I thought it was for some obscure compatibility reason.
> More recently Intel used a "DCU 16kB mode" option in
> its Xeons. This changed the data cache unit from 32kB 8-way associative, to mirrored 16kB 4-way
> halves and ECC achieved with parity finding correct copy. This seems to have gone away in favor
> of an allegedly more robust L1D sram cell and they have no ECC on writeback L1.
The way I understand it is that resistance to soft errors in the SRAM cell depends on the amount of stored charge, but higher stored charge requires greater power to perform a write operation because the drivers have to overcome the stored charge to perform the write. There are various ways to "unlock" the SRAM cell for the write, but they require more transistors and more space. In the case of CPU cache, access frequency is very high so the extra charge which would make the SRAM resistant to soft errors has a high cost in power.
> I have no issue with this. Reliability is limited by chance of more than correctable bitflips,
> if 1 bitflip has very small chance then reliability can be fine. I'm no array designer but it
> does seem like at some point at the very high end of reliability, having ECC would be better
> than increasing bit reliability. But perhaps for Xeon reliability goal that is enough.
I was thinking that the performance loss from having to do a read, modify, and write cycle was eventually high enough that the extra cost of implemented SRAM cache with the ability to have low power writes but retain high charge and low soft error rate became a worthwhile tradeoff.
I do not know how it would fit in here, but the high access frequency also contributes to increasing the soft error rate. The same memory accesses at a slower rate has fewer soft errors, so the L1 cache is particularly vulnerable.