CPU & Memory bit flips

By: rwessel (rwessel.delete@this.yahoo.com), March 4, 2021 12:27 pm
Room: Moderated Discussions
dmcq (dmcq.delete@this.fano.co.uk) on March 4, 2021 6:59 am wrote:
> zArchJon (Anon.delete@this.anon.com) on March 4, 2021 6:39 am wrote:
> > Ganon (anon.delete@this.gmail.com) on March 3, 2021 9:05 am wrote:
> > > A recent pair of papers from facebook emphasized the importance of checksum protection even
> > > within a single process:
> > >
> > > Facebook’s Tectonic Filesystem:Efficiency from Exascale
> > > https://www.usenix.org/system/files/fast21-pan.pdf
> > >
> > > "
> > > At Tectonic’s scale, with thousands of machines reading and writing a large amount of data every day,
> > > in-memory data corruption is a regular occurrence, a phenomenon observed in other large-scale systems
> > > [12,27]. We address this by enforcing checksum checks within and between process boundaries.
> > > "
> > >
> > > and
> > >
> > > Evolution of Development Priorities in Key-value Stores
> > > Serving Large-scale Applications:The RocksDB Experience
> > > https://www.usenix.org/system/files/fast21-dong.pdf
> > >
> > > "
> > > 11. CPU/memory corruption does happen, though very rarely,
> > > and sometimes cannot be handled by data replication. (§5)
> > >
> > > 12.Integrity protection must cover the entire system in order to prevent corrupted data (e.g.,
> > > caused by bitflips in CPU/memory) from being exposed to clients or other replicas; detecting
> > > corruption only when the data is at rest or being sent over the wire is insufficient. (§5)
> > > "
> > >
> > >
> > > -----
> > > Checksums & similar protect the data but what about the code (instructions)? The total data footprint
> > > of instructions is smaller so the bitflips are less likely in practice there. Does hw take special
> > > measures to protect instructions from corruption (more than it does for data)? What sw measures
> > > make sense to protect instructions (assuming we need to care about this as well)?
> >
> > Along with legacy software, this is one reason people pay the extra price for IBM Z systems.
> > All of the memory is protected with RAIM (Redundant Array of Independent Memory) which
> > not only allows for an entire DIMM failure, but also provides robust error detection and
> > correction on DIMMs. See: https://www.ibm.com/community/z/wp-content/uploads/sites/14/2020/04/sysdevblog-4cde-ibm2520zenterprise2520raim.pdf
> >
> > For more technical background.
> >
> > Along with protecting the memory, all of the caches and directories are protected and the CPUs
> > even have robust error detection and transparent retry mechanisms when a soft-error is detected.
> > And transparent CPU sparing when the error does not go away after a threshold of retries.
> >
> > Even access to remote disks is protected using additional CRCs in the FICON layer
> > on top of Fiber Channel to detect bit errors above the physical protocol layer.
> >
>
> Yep they go in for checks in depth. I've wondered how they check for soft errors in the CPU. One could do parity
> or modulo checks for arithmetic opertions but one would need to duplicate logic operations I think. And of
> course all the control would need checking too. Triple redundancy checks is the simplest way round that I can
> see but I don't think they are doing that. I'd be inclined to just run the CPUs every so often in a test mode
> with various extremes of the clock and power and say that's fine for the next while if they pass.


There's a wide ranging literature and use of self-checking circuits. Usually you can do considerably better than full duplication. Very often you can do things like compute something parity-ish of the result (from the inputs) in parallel with the actual circuit, and then check that. Often referred to as "compressors". Applying that sort of thing to ALU-ish constructs is old hat, thinks like caches are ECC protectable fairly easily, busses can get simple parity, but I'm sure the wide/OoO/dataflow-ish structures need their own special techniques, but I've not kept up.

That's usually coupled with some sort of checkpoint/rollback mechanism (similar in principle to how speculative execution is rolled back).

After a retry fails, you want enough hardware to push the architectural state (caches, registers) to memory, where it can be resumed by another processor. On modern Z, that state move can be done by the hardware to a hot spare processor (unless the system is out of spares, in which case it punts the move to the OS as described above, and which has been part of the architectures since S/360).

Of course none of that helps if the core fails hard enough (so that you can't extract the checkpointed state from before the failure).
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
CPU & Memory bit flipsGanon2021/03/03 09:05 AM
  Also "Silent Data Corruption"Adrian2021/03/03 10:42 AM
    Thanks for the referenceGanon2021/03/03 11:47 AM
  Implications for linux page cacheanon2021/03/03 11:54 AM
    Implications for linux page cacheLinus Torvalds2021/03/03 01:54 PM
      memory errorsblaine2021/03/03 02:53 PM
        memory errorsanon22021/03/03 05:30 PM
          memory errorsdmcq2021/03/04 05:16 AM
            memory errorsEtienne Lorrain2021/03/04 06:26 AM
              memory errorsdmcq2021/03/04 06:40 AM
                memory errorsEtienne Lorrain2021/03/04 06:58 AM
                  memory errorsdmcq2021/03/04 07:12 AM
                  memory errorsCarson2021/03/05 02:31 AM
                    memory errorsEtienne Lorrain2021/03/05 06:23 AM
                      memory errorsrwessel2021/03/05 07:48 AM
                      memory errorsdmcq2021/03/05 12:01 PM
                        memory errorsrwessel2021/03/05 12:23 PM
                          memory errorsdmcq2021/03/05 12:51 PM
                      memory errorsBrendan2021/03/05 11:38 PM
                      memory errorsCarson2021/03/06 01:35 AM
                        memory errorsCarson2021/03/06 06:24 AM
                memory errorsDavid Hess2021/03/04 01:44 PM
                  memory errorsrwessel2021/03/04 05:14 PM
                  memory errorsLinus Torvalds2021/03/04 08:21 PM
                    memory errorsanon22021/03/04 09:46 PM
                      memory errorsCarson2021/03/05 02:43 AM
                        memory errorsanon22021/03/05 07:55 AM
                    memory errorsgallier22021/03/05 02:22 AM
                  memory errorsdmcq2021/03/05 12:59 PM
                    memory errorsDavid Hess2021/03/06 04:27 AM
                      memory errorsCarson2021/03/06 06:44 AM
                      memory errorsGabriele Svelto2021/03/06 10:11 AM
                        memory errorsDavid Hess2021/03/06 10:28 AM
                          memory errorsMichael S2021/03/06 02:45 PM
              memory errorsDoug S2021/03/04 10:48 AM
                memory errorsMichael S2021/03/04 11:36 AM
              memory errorsJörn Engel2021/03/04 03:32 PM
                memory errorsLinus Torvalds2021/03/04 08:47 PM
                  memory errorsEtienne Lorrain2021/03/05 01:09 AM
                  memory errorsMichael S2021/03/05 04:06 AM
                    memory errorsLinus Torvalds2021/03/05 11:59 AM
                      memory errorsrwessel2021/03/05 12:32 PM
                        memory errorsrwessel2021/03/05 12:37 PM
                        memory errorszArchJon2021/03/06 08:39 PM
                      memory errorsGabriele Svelto2021/03/06 12:58 PM
                  memory errorsJörn Engel2021/03/05 10:12 AM
                Amiga recoverable RAM disk?Carson2021/03/05 03:03 AM
                  Thanks - TIL a cool Amiga feature (nt) (NT)John2021/03/05 12:51 PM
                    Another cool Amiga feature, datatypesCharles2021/03/06 12:01 AM
                      Another cool Amiga feature, datatypesJukka Larja2021/03/06 01:23 AM
                      Another cool Amiga feature, datatypesAnon2021/03/06 12:40 PM
                      Another cool Amiga feature, filesystemsMarcus2021/03/07 12:28 AM
  CPU & Memory bit flipszArchJon2021/03/04 06:39 AM
    CPU & Memory bit flipsdmcq2021/03/04 06:59 AM
      CPU & Memory bit flipsrwessel2021/03/04 12:27 PM
  speak of the devilRobert Williams2021/03/05 07:53 AM
    speak of the devildmcq2021/03/05 11:26 AM
      speak of the devilRobert Williams2021/03/05 03:15 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?