Data integrity of L1 caches

By: Brett (ggtgp.delete@this.yahoo.com), September 16, 2022 12:12 pm
Room: Moderated Discussions
anon2 (anon.delete@this.anon.com) on September 16, 2022 9:04 am wrote:
> anon.1 (abc.delete@this.def.com) on September 16, 2022 6:51 am wrote:
> > anon2 (anon.delete@this.anon.com) on September 15, 2022 7:04 pm wrote:
> > > Everybody knows the data integrity problems with parity protected write-back arrays. ECC has also
> > > seemed to be a difficult problem for L1 data cache that seems like nobody has solved very well.
> > >
> > > The options seem to be:
> > > - A write-back L1D with parity and accept lack of correction.
> > > - A write-through L1D with ECC L2.
> > > - Expensive L1 ECC scheme.
> > >
> > > A very long time ago I recall some CPUs had a bios selection between write-back and write-through
> > > L1, possibly integrity was the reason. More recently Intel used a "DCU 16kB mode" option in
> > > its Xeons. This changed the data cache unit from 32kB 8-way associative, to mirrored 16kB 4-way
> > > halves and ECC achieved with parity finding correct copy. This seems to have gone away in favor
> > > of an allegedly more robust L1D sram cell and they have no ECC on writeback L1.
> > >
> > > I have no issue with this. Reliability is limited by chance of more than correctable bitflips,
> > > if 1 bitflip has very small chance then reliability can be fine. I'm no array designer but it
> > > does seem like at some point at the very high end of reliability, having ECC would be better
> > > than increasing bit reliability. But perhaps for Xeon reliability goal that is enough.
> > >
> > > If it's good enough for Xeon, it seems likely all other "normal" CPU designs have gone this way too.
> > > Exception would be certain highly reliable or rad hard embedded, and mainframes and the like.
> > >
> > > What is expensive about L1 ECC which is less costly in
> > > L2? Keep in mind you need write-through, so L2 has to
> > > receive all the stores. Stores could be buffered and merged
> > > along the way to the L2, but surely they could also
> > > be buffered and merged along the way to L1 in a write-back
> > > design. L1 may have a lot more misses / refills than
> > > L2, but if ECC calculation is the expensive part, then ECC
> > > bits should be shipped to L1. I wonder what is the
> > > really costly part? Or is the answer that the benefit of write-back
> > > L1 just not very large? (But that would prompt
> > > the question then why others do not do a write-through design if it does not hurt performance much)
> >
> > AMD claims to have ECC on L1D cache. I didn't look for Xeon assuming
> > you did check their documentation before stating what you did.
>
> I don't actually know what the recent couple of generations of Xeons do.
>
> >
> > "The AMD Family 17h processor contains a 32-Kbyte, 8-way set associative L1 data cache.
> > This is a write-back cache that supports two 128-bit loads and one 128-bit store per cycle.
> > In addition, the L1 cache is protected from bit errors through the use of ECC. There is
> > a hardware prefetcher that brings data into the L1 data cache to avoid misses. "
> >
> > https://developer.amd.com/wordpress/media/2013/12/55723_SOG_Fam_17h_Processors_3.00.pdf
> >
>
> Interesting find. I wonder what strategy is used, there wasn't much else in there.

I am willing to bet there is a salvage bin for low end consumer parts.
You ignore a bit line failure and let the ECC fix it.

Variants of this should be used throughout a design to maximize return on investment, the difference between profit and loss on a flakey fab process, which all leading edge processes are.

Many companies that think they are going to design a server only chip are basically doomed by this. Not enough working chips, and letting the flaky ones out ruin your reputation.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Data integrity of L1 cachesanon22022/09/15 07:04 PM
  Data integrity of L1 cachesGroo2022/09/15 11:46 PM
    Data integrity of L1 cachesanon22022/09/16 09:00 AM
      Data integrity of L1 cachesgroo2022/09/16 11:06 AM
        ECC outside critical path?hobold2022/09/16 01:03 PM
          ECC outside critical path?Mr. Camel2022/09/16 03:39 PM
            ECC outside critical path?anonymou52022/09/16 05:01 PM
          ECC outside critical path?anonymou52022/09/16 04:50 PM
            ECC outside critical path?hobold2022/09/17 06:57 AM
        Data integrity of L1 cachesanon22022/09/16 05:45 PM
  Data integrity of L1 cachesanon.12022/09/16 06:51 AM
    Data integrity of L1 cachesanon22022/09/16 09:04 AM
      Data integrity of L1 cachesBrett2022/09/16 12:12 PM
  Data integrity of L1 caches---2022/09/16 11:28 AM
    Data integrity of L1 cachesdmcq2022/09/16 01:41 PM
      Data integrity of L1 caches---2022/09/16 02:42 PM
    Data integrity of L1 cachesanon22022/09/16 05:49 PM
      Data integrity of L1 caches---2022/09/16 06:25 PM
        Read the thread (NT)anon22022/09/16 06:55 PM
        Data integrity of L1 cachesanon22022/09/16 06:57 PM
    Data integrity of L1 cachesMichael S2022/09/17 05:02 PM
  Data integrity of L1 cachesDavid Kanter2022/09/16 09:44 PM
    ECC word not necessarily full cache linePaul A. Clayton2022/09/17 10:59 AM
      ECC word not necessarily full cache lineDavid Kanter2022/09/18 12:29 PM
        ECC word not necessarily full cache lineAnon2022/09/18 12:54 PM
          ECC word not necessarily full cache linehobold2022/09/18 06:32 PM
            ECC word not necessarily full cache lineMichael S2022/09/19 08:47 AM
              ECC word not necessarily full cache linehobold2022/09/20 06:38 AM
                ECC word not necessarily full cache linedmcq2022/09/21 05:10 AM
                ECC word not necessarily full cache lineMichael S2022/09/21 06:55 AM
                  ECC word not necessarily full cache linehobold2022/09/21 01:59 PM
  Data integrity of L1 cachesDavid Hess2022/09/17 10:03 AM
  Data integrity of L1 cachesMichael S2022/09/17 05:12 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊