By: Michael S (already5chosen.delete@this.yahoo.com), September 17, 2022 4:02 pm
Room: Moderated Discussions
--- (---.delete@this.redheron.com) on September 16, 2022 11:28 am wrote:
> anon2 (anon.delete@this.anon.com) on September 15, 2022 7:04 pm wrote:
> > Everybody knows the data integrity problems with parity protected write-back arrays. ECC has also
> > seemed to be a difficult problem for L1 data cache that seems like nobody has solved very well.
> >
>
> > What is expensive about L1 ECC which is less costly in
> > L2? Keep in mind you need write-through, so L2 has to
> > receive all the stores.
>
> ECC saves bits (relative to simpler options like replication) by performing math across
> the ENTIRE line.
I'd guess, nobody tries to save *that* many bits.
More typically, L1D ECC, when used, is calculated either over 64+8bit or 32+7bit blocks.
> This means that each time you modify an element of L1, something (whether
> in the core itself or in the L1) will have to read the entire line (whether before or
> after the write) to calculate the new ECC value. That's a lot of extra power.
>
> L2 is different because the entire line is written at once
> from the L1 to the L2, so there's a one-time calculation.
>
> Now, could you avoid this by gathering writes in the store queue before sending them to L1?
> To some extent yes, but
> - you can't be sure that an entire line will be accumulated before there is some reason
> to push the line out to L1. So you still have to have the above machinery.
> - there are various constraints in the x86 memory model that limit the extent to which this
> sort of store gathering is permitted. I don't know the details, but my understanding is that
> the ARM memory model allows for much more aggressive gathering than is feasible under x86.
>
> I think (but don't trust this in the slightest!) that, for example, an unbroken sequence
> of stores to the same line by x86 can be gathered, but if an out of sequence store is
> then performed to a different line, the previous gathering has to terminate (and be single
> unit to be pushed to L1) before a new gathering in that same line can begin.
> Conversely ARM does not care about gathering that bounces around between multiple lines in any order; the
> gathering in each line will continue till heuristics of whatever sort decide it's appropriate to write to
> L1. [Of course barriers or snoops modify behavior in both cases, but I'm describing the basic case.]
> anon2 (anon.delete@this.anon.com) on September 15, 2022 7:04 pm wrote:
> > Everybody knows the data integrity problems with parity protected write-back arrays. ECC has also
> > seemed to be a difficult problem for L1 data cache that seems like nobody has solved very well.
> >
>
> > What is expensive about L1 ECC which is less costly in
> > L2? Keep in mind you need write-through, so L2 has to
> > receive all the stores.
>
> ECC saves bits (relative to simpler options like replication) by performing math across
> the ENTIRE line.
I'd guess, nobody tries to save *that* many bits.
More typically, L1D ECC, when used, is calculated either over 64+8bit or 32+7bit blocks.
> This means that each time you modify an element of L1, something (whether
> in the core itself or in the L1) will have to read the entire line (whether before or
> after the write) to calculate the new ECC value. That's a lot of extra power.
>
> L2 is different because the entire line is written at once
> from the L1 to the L2, so there's a one-time calculation.
>
> Now, could you avoid this by gathering writes in the store queue before sending them to L1?
> To some extent yes, but
> - you can't be sure that an entire line will be accumulated before there is some reason
> to push the line out to L1. So you still have to have the above machinery.
> - there are various constraints in the x86 memory model that limit the extent to which this
> sort of store gathering is permitted. I don't know the details, but my understanding is that
> the ARM memory model allows for much more aggressive gathering than is feasible under x86.
>
> I think (but don't trust this in the slightest!) that, for example, an unbroken sequence
> of stores to the same line by x86 can be gathered, but if an out of sequence store is
> then performed to a different line, the previous gathering has to terminate (and be single
> unit to be pushed to L1) before a new gathering in that same line can begin.
> Conversely ARM does not care about gathering that bounces around between multiple lines in any order; the
> gathering in each line will continue till heuristics of whatever sort decide it's appropriate to write to
> L1. [Of course barriers or snoops modify behavior in both cases, but I'm describing the basic case.]