By: anon2 (anon.delete@this.anon.com), September 15, 2022 7:04 pm
Room: Moderated Discussions
Everybody knows the data integrity problems with parity protected write-back arrays. ECC has also seemed to be a difficult problem for L1 data cache that seems like nobody has solved very well.
The options seem to be:
- A write-back L1D with parity and accept lack of correction.
- A write-through L1D with ECC L2.
- Expensive L1 ECC scheme.
A very long time ago I recall some CPUs had a bios selection between write-back and write-through L1, possibly integrity was the reason. More recently Intel used a "DCU 16kB mode" option in its Xeons. This changed the data cache unit from 32kB 8-way associative, to mirrored 16kB 4-way halves and ECC achieved with parity finding correct copy. This seems to have gone away in favor of an allegedly more robust L1D sram cell and they have no ECC on writeback L1.
I have no issue with this. Reliability is limited by chance of more than correctable bitflips, if 1 bitflip has very small chance then reliability can be fine. I'm no array designer but it does seem like at some point at the very high end of reliability, having ECC would be better than increasing bit reliability. But perhaps for Xeon reliability goal that is enough.
If it's good enough for Xeon, it seems likely all other "normal" CPU designs have gone this way too. Exception would be certain highly reliable or rad hard embedded, and mainframes and the like.
What is expensive about L1 ECC which is less costly in L2? Keep in mind you need write-through, so L2 has to receive all the stores. Stores could be buffered and merged along the way to the L2, but surely they could also be buffered and merged along the way to L1 in a write-back design. L1 may have a lot more misses / refills than L2, but if ECC calculation is the expensive part, then ECC bits should be shipped to L1. I wonder what is the really costly part? Or is the answer that the benefit of write-back L1 just not very large? (But that would prompt the question then why others do not do a write-through design if it does not hurt performance much)
The options seem to be:
- A write-back L1D with parity and accept lack of correction.
- A write-through L1D with ECC L2.
- Expensive L1 ECC scheme.
A very long time ago I recall some CPUs had a bios selection between write-back and write-through L1, possibly integrity was the reason. More recently Intel used a "DCU 16kB mode" option in its Xeons. This changed the data cache unit from 32kB 8-way associative, to mirrored 16kB 4-way halves and ECC achieved with parity finding correct copy. This seems to have gone away in favor of an allegedly more robust L1D sram cell and they have no ECC on writeback L1.
I have no issue with this. Reliability is limited by chance of more than correctable bitflips, if 1 bitflip has very small chance then reliability can be fine. I'm no array designer but it does seem like at some point at the very high end of reliability, having ECC would be better than increasing bit reliability. But perhaps for Xeon reliability goal that is enough.
If it's good enough for Xeon, it seems likely all other "normal" CPU designs have gone this way too. Exception would be certain highly reliable or rad hard embedded, and mainframes and the like.
What is expensive about L1 ECC which is less costly in L2? Keep in mind you need write-through, so L2 has to receive all the stores. Stores could be buffered and merged along the way to the L2, but surely they could also be buffered and merged along the way to L1 in a write-back design. L1 may have a lot more misses / refills than L2, but if ECC calculation is the expensive part, then ECC bits should be shipped to L1. I wonder what is the really costly part? Or is the answer that the benefit of write-back L1 just not very large? (But that would prompt the question then why others do not do a write-through design if it does not hurt performance much)