By: dmcq (dmcq.delete@this.fano.co.uk), September 16, 2022 1:41 pm
Room: Moderated Discussions
--- (---.delete@this.redheron.com) on September 16, 2022 11:28 am wrote:
> anon2 (anon.delete@this.anon.com) on September 15, 2022 7:04 pm wrote:
> > Everybody knows the data integrity problems with parity protected write-back arrays. ECC has also
> > seemed to be a difficult problem for L1 data cache that seems like nobody has solved very well.
> >
>
> > What is expensive about L1 ECC which is less costly in
> > L2? Keep in mind you need write-through, so L2 has to
> > receive all the stores.
>
> ECC saves bits (relative to simpler options like replication) by performing math across
> the ENTIRE line. This means that each time you modify an element of L1, something (whether
> in the core itself or in the L1) will have to read the entire line (whether before or
> after the write) to calculate the new ECC value. That's a lot of extra power.
>
> L2 is different because the entire line is written at once
> from the L1 to the L2, so there's a one-time calculation.
>
> Now, could you avoid this by gathering writes in the store queue before sending them to L1?
> To some extent yes, but
> - you can't be sure that an entire line will be accumulated before there is some reason
> to push the line out to L1. So you still have to have the above machinery.
> - there are various constraints in the x86 memory model that limit the extent to which this
> sort of store gathering is permitted. I don't know the details, but my understanding is that
> the ARM memory model allows for much more aggressive gathering than is feasible under x86.
>
> I think (but don't trust this in the slightest!) that, for example, an unbroken sequence
> of stores to the same line by x86 can be gathered, but if an out of sequence store is
> then performed to a different line, the previous gathering has to terminate (and be single
> unit to be pushed to L1) before a new gathering in that same line can begin.
> Conversely ARM does not care about gathering that bounces around between multiple lines in any order; the
> gathering in each line will continue till heuristics of whatever sort decide it's appropriate to write to
> L1. [Of course barriers or snoops modify behavior in both cases, but I'm describing the basic case.]
I think even ARM suffers that problem. As far as I can see because of lock free programming one has to assume most of what you are saying about the x86. I think it could be fixed if all lock free programming used load acquire store release for loads and stores. That would probably impact the lock free timing unfortunately but I think it would be worth it. Even better would have been lock free load and lock free store instructions for pointers to make the whole business obvious and take advantage of any difference.
As far as I can see the ARM server designs use ECC for L1D and parity for L1C. They also offer ECC as an option for a lot of their other designs, in particular for their microcontrollers.
> anon2 (anon.delete@this.anon.com) on September 15, 2022 7:04 pm wrote:
> > Everybody knows the data integrity problems with parity protected write-back arrays. ECC has also
> > seemed to be a difficult problem for L1 data cache that seems like nobody has solved very well.
> >
>
> > What is expensive about L1 ECC which is less costly in
> > L2? Keep in mind you need write-through, so L2 has to
> > receive all the stores.
>
> ECC saves bits (relative to simpler options like replication) by performing math across
> the ENTIRE line. This means that each time you modify an element of L1, something (whether
> in the core itself or in the L1) will have to read the entire line (whether before or
> after the write) to calculate the new ECC value. That's a lot of extra power.
>
> L2 is different because the entire line is written at once
> from the L1 to the L2, so there's a one-time calculation.
>
> Now, could you avoid this by gathering writes in the store queue before sending them to L1?
> To some extent yes, but
> - you can't be sure that an entire line will be accumulated before there is some reason
> to push the line out to L1. So you still have to have the above machinery.
> - there are various constraints in the x86 memory model that limit the extent to which this
> sort of store gathering is permitted. I don't know the details, but my understanding is that
> the ARM memory model allows for much more aggressive gathering than is feasible under x86.
>
> I think (but don't trust this in the slightest!) that, for example, an unbroken sequence
> of stores to the same line by x86 can be gathered, but if an out of sequence store is
> then performed to a different line, the previous gathering has to terminate (and be single
> unit to be pushed to L1) before a new gathering in that same line can begin.
> Conversely ARM does not care about gathering that bounces around between multiple lines in any order; the
> gathering in each line will continue till heuristics of whatever sort decide it's appropriate to write to
> L1. [Of course barriers or snoops modify behavior in both cases, but I'm describing the basic case.]
I think even ARM suffers that problem. As far as I can see because of lock free programming one has to assume most of what you are saying about the x86. I think it could be fixed if all lock free programming used load acquire store release for loads and stores. That would probably impact the lock free timing unfortunately but I think it would be worth it. Even better would have been lock free load and lock free store instructions for pointers to make the whole business obvious and take advantage of any difference.
As far as I can see the ARM server designs use ECC for L1D and parity for L1C. They also offer ECC as an option for a lot of their other designs, in particular for their microcontrollers.