By: Michael S (already5chosen.delete@this.yahoo.com), September 19, 2022 8:47 am
Room: Moderated Discussions
hobold (hobold.delete@this.vectorizer.org) on September 18, 2022 6:32 pm wrote:
> Anon (no.delete@this.spam.com) on September 18, 2022 12:54 pm wrote:
> > David Kanter (dkanter.delete@this.realworldtech.com) on September 18, 2022 12:29 pm wrote:
> > > That's true, but sub-line ECC has a much higher area overhead.
> > > So maybe I'd state it as 'power or area, take your pick'...
> >
> > Does it really matter? I mean, if you only modify 8 bytes out of the 64 bytes of the cache line then
> > old_value XOR new_value should give you enough information to update the ECC bits of the entire cache
> > line, only parity would actually appear per 8 bytes or so to not increase read latency too much.
>
> If I am not mistaken, there used to be specific ECC codes that have a geometric interpretation.
> With such a code, I think, oldVal XOR newVal would effectively yield a mirror plane (well,
> hyperplane ... something one dimension smaller than the entire code space). And then
> oldCode mirrored on that plane should result in the correct newCode.
>
> I don't remember the name of these specific codes, and if they are still in active use today. But my
> nebulous memory says it was this construction based on hypercubes such that any corner making up a valid
> code is surrounded by an edge subgraph of invalid codes, such that ... well, one can think of the neighbourhood
> as three rings. The innermost ring is all the single bit errors, so those are correctable. The 2nd ring
> is all two bit errors, so those are detectable. But the outermost 3rd ring is already made up of direct
> neighbours of other valid codes, so those will lead to an unrecoverable data loss.
What you described in your last sentence is normally referred as "hamming distance between legal codes >= 3" and is a necessary property of all SECDED codes.
As to real-world use, you probably would not find codes like that used for ECC of main memory in mid-high-end expensive server. More likely, code used there try to take advantage of non-uniform distribution nature of real-world errors (some memory chips are not as good as the others) in order to make correction more robust.
But for relatively small internal SRAM arrays expected distribution is pretty close to uniform, so such classic code should work fine. Without internal knowledge I fully expect that either classic Hamming codes or something very similar is used in practice for L2 cache ECC and also, on devices that have it, for L1D ECC.
> Anon (no.delete@this.spam.com) on September 18, 2022 12:54 pm wrote:
> > David Kanter (dkanter.delete@this.realworldtech.com) on September 18, 2022 12:29 pm wrote:
> > > That's true, but sub-line ECC has a much higher area overhead.
> > > So maybe I'd state it as 'power or area, take your pick'...
> >
> > Does it really matter? I mean, if you only modify 8 bytes out of the 64 bytes of the cache line then
> > old_value XOR new_value should give you enough information to update the ECC bits of the entire cache
> > line, only parity would actually appear per 8 bytes or so to not increase read latency too much.
>
> If I am not mistaken, there used to be specific ECC codes that have a geometric interpretation.
> With such a code, I think, oldVal XOR newVal would effectively yield a mirror plane (well,
> hyperplane ... something one dimension smaller than the entire code space). And then
> oldCode mirrored on that plane should result in the correct newCode.
>
> I don't remember the name of these specific codes, and if they are still in active use today. But my
> nebulous memory says it was this construction based on hypercubes such that any corner making up a valid
> code is surrounded by an edge subgraph of invalid codes, such that ... well, one can think of the neighbourhood
> as three rings. The innermost ring is all the single bit errors, so those are correctable. The 2nd ring
> is all two bit errors, so those are detectable. But the outermost 3rd ring is already made up of direct
> neighbours of other valid codes, so those will lead to an unrecoverable data loss.
What you described in your last sentence is normally referred as "hamming distance between legal codes >= 3" and is a necessary property of all SECDED codes.
As to real-world use, you probably would not find codes like that used for ECC of main memory in mid-high-end expensive server. More likely, code used there try to take advantage of non-uniform distribution nature of real-world errors (some memory chips are not as good as the others) in order to make correction more robust.
But for relatively small internal SRAM arrays expected distribution is pretty close to uniform, so such classic code should work fine. Without internal knowledge I fully expect that either classic Hamming codes or something very similar is used in practice for L2 cache ECC and also, on devices that have it, for L1D ECC.