memory errors

By: Brendan (, March 6, 2021 12:38 am
Room: Moderated Discussions
Etienne Lorrain ( on March 5, 2021 6:23 am wrote:
> Carson ( on March 5, 2021 2:31 am wrote:
> > Etienne Lorrain ( on March 4, 2021 6:58 am wrote:
> > > You cannot "only initialising a page when it was allocated" because the BIOS do not allocate
> > > memory, it just tells you how much memory is available, and the concept of page is not known
> > > (no virtual memory at that point). Moreover the page size is decided by the OS, processor support
> > > different sizes. Moreover memory over 1 Megabyte was special, either EMS or XMS or HMA.
> > > And you could not modify DOS, proprietary software with no source.
> > > The real missing thing was (and still partly is) a proper DMA, able to send more than 64 Kbytes
> > > and able to access more than 1 Megabyte / 16 Mbytes with chipset recognition and NDA docs.
> >
> > These all seem like non-issues.
> >
> > For OSes which cannot handle uninitialized ECC, have a forward-compatibility flag somewhere
> > in the boot loader which means "I can handle uninitialized ECC". If it's not present, BIOS
> > clears memory before transferring to the boot loader. A chaining boot loader (LILO or whatever)
> > is required to do the same check on OSes is chain-loads. (By calling back into the BIOS which
> > has all the necessary code, so the chain loader doesn't suffer much code bloat.)
> >
> > For loaders which do support uninitialized ECC, a simple BIOS data structure (like the 0xE820 memory
> > map) describes the initialized parts, and there are BIOS calls to extend the initialized parts.
> >
> You make me feel old, that is bad on Friday -:)
> BIOS was developed at a time where Linux was a sin, you would be made immediately redundant
> if you installed it on a PC at work (because there was no anti-virus on Linux).
> Because DOS did not manage ECC at all, and Windows 3.1/95 started from
> DOS, the BIOS had to initialise all the memory it knew about.
> DOS did not know about EMS or XMS memory, nothing more than 1 Mbyte.
> EMS could be provided by an ISA card (i.e. something like PCI), and video/network card
> BIOS would be on the ISA/PCI card itself, their BIOS did not manage ECC either.
> > The loader asks the BIOS to initialize enough to hold whatever it's chain-loading,
> > and that loader does the same for its own bss and initial stack.
> >
> > Then the OS's early boot probes the hardware to find out
> > if it's capable of talking to the ECC hardware without
> > BIOS support. If not, before the BIOS is fully disabled, fall back to asking it to initialize everything.
> >
> > With all of those fallbacks implemented, the hopefully common case is that the OS does know how to drive
> > the ECC hardware and it initializes its own memory allocator with all unallocated memory flagged "ECC bad".
> > The first time that memory is allocated, it is initialized (which may be as simple as CLZERO).
> >
> > Since the OS is doing all of this, it understands the page size and all the necessary rules. The important
> > point is that the initialization is done lazily, after applications have started running.
> >
> > A more sophisticated OS might extend the idea of "potentially
> > uninitialized page" beyond the memory allocator
> > proper, and things like disk DMA which are going to overwrite
> > the whole page could elide the initialization.
> > Since this overlaps heavily with the zeroing required for security, it's not actually a major project.
> >
> > This all seems like a pretty straightforward SMOP.
> Nowadays things have changed, the EFI system is a lot more complicated, its whole behaviour/specification
> is under NDA (Non Disclosure Agreement), and secure boot will stop you implementing anything
> which is not approved by the manufacturer - or not supported by Window 10.
> So basically none of your sophisticated idea can be implemented, if ECC has to be supported
> it has to be done by the EFI BIOS (Windows10 do not want to hear about ECC), and you get
> what Linus was complaining about: a machine-check exception with no way to know even the
> processor which triggered the ECC error or the address which caused the problem.
> That is what people call progress: a nice background image telling you how intelligent you
> are by having bought this top of the range brand of PC, during the whole Windows boot time.

BIOS is deprecated and it's not worth the hassle of trying to change it now. Newer operating systems (that would support a new "lazy ECC initialization" feature) can/should all use UEFI instead.

For UEFI, complexity isn't a problem, behavior that's under NDA isn't a problem (as the people who write those parts of firmware already signed the NDA), and Secure Boot isn't a problem. I suspect that you were thinking of a "crazed cowboy gives appropriate standards bodies the finger and tries to change the world all by themselves" scenario rather than a "competent professional lobbies the industry for changes to established (UEFI) standards" scenario.

More specifically; the Unified EFI Forum would probably only need to make 3 changes to the UEFI spec:

a) define a new value for the "Windows Subsystem type" field in PE32+ headers (e.g. "IMAGE_SUBSYSTEM_EFI_APPLICATION_ECC_AWARE") that's used by boot loaders (and shouldn't be used for UEFI applications - utilities, drivers, etc)

b) define a new value for "area type" in UEFI's memory map for "RAM that isn't usable until ECC is initialized for it". Existing/old software will ignore these areas because they won't know what the area type means, which is fine.

c) define a new file name for boot loaders on removable media - e.g. OS installers, etc (maybe "BOOTBOOTX64.ECC" instead of "BOOTBOOTX64.EFI")

With those changes in the UEFI spec; at power-on the firmware could initialize ECC for the first 1 GiB of RAM only; then it could do lazy initialization itself if UEFI software (drivers, utilities) try to allocate more RAM than has been initialized; then finally if/when something calls "ExitBootServices()" it can look at the boot loader's subsystem type and either (for IMAGE_SUBSYSTEM_EFI_APPLICATION) initialize the rest of the ECC and update the memory map, or (for IMAGE_SUBSYSTEM_EFI_APPLICATION_ECC_AWARE) leave it uninitialized and let the OS worry about it.

The end result is that all the existing/old UEFI software (drivers, applications/utilities, boot loaders) would continue to work the same as they do now, and new "ECC aware" boot loaders would work on new systems faster; but new "ECC aware" boot loaders won't work/will be ignored on old computers that don't support it (and OS developers would need to provide a fallback boot loader for old/existing computers, but that's relatively easy for OS developers to do).

The only real problem is whether anyone (Unified EFI Forum members, firmware developers, Microsoft) care enough to add support for it. I honestly doubt that they do care enough because the number of users who need very fast boot times for systems with ECC is almost zero. People that need extremely high availability use hot spare fall-over (and avoid the need to reboot fully - hot-plug, things like "kexec()" for kernel updates); and the people that would benefit the most from fast boot times are people using small mobile devices (notebook, laptop) that don't have ECC. I'd even go a step further and say that the fact that it wasn't already added to UEFI over the last 20 years or so is proof that nobody cares enough to bother.

Note that (around 10 years ago now?) Microsoft pushed hard to improve boot times and this led to "fast boot" and "ultra fast boot" options being added to (UEFI) firmware; where (if these modes are enabled in firmware setup) the initialization of some devices is skipped by firmware, the "wait for user to ask to enter firmware setup" delay is disabled and a few other changes (no CSM) occur. It's reasonable to assume that (while Microsoft was trying to improve boot times and talking firmware developers into adopting a new "fast boot" feature) someone at Microsoft noticed how long ECC initialization was taking and decided not to bother doing anything about it.

- Brendan
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
CPU & Memory bit flipsGanon2021/03/03 10:05 AM
  Also "Silent Data Corruption"Adrian2021/03/03 11:42 AM
    Thanks for the referenceGanon2021/03/03 12:47 PM
  Implications for linux page cacheanon2021/03/03 12:54 PM
    Implications for linux page cacheLinus Torvalds2021/03/03 02:54 PM
      memory errorsblaine2021/03/03 03:53 PM
        memory errorsanon22021/03/03 06:30 PM
          memory errorsdmcq2021/03/04 06:16 AM
            memory errorsEtienne Lorrain2021/03/04 07:26 AM
              memory errorsdmcq2021/03/04 07:40 AM
                memory errorsEtienne Lorrain2021/03/04 07:58 AM
                  memory errorsdmcq2021/03/04 08:12 AM
                  memory errorsCarson2021/03/05 03:31 AM
                    memory errorsEtienne Lorrain2021/03/05 07:23 AM
                      memory errorsrwessel2021/03/05 08:48 AM
                      memory errorsdmcq2021/03/05 01:01 PM
                        memory errorsrwessel2021/03/05 01:23 PM
                          memory errorsdmcq2021/03/05 01:51 PM
                      memory errorsBrendan2021/03/06 12:38 AM
                      memory errorsCarson2021/03/06 02:35 AM
                        memory errorsCarson2021/03/06 07:24 AM
                memory errorsDavid Hess2021/03/04 02:44 PM
                  memory errorsrwessel2021/03/04 06:14 PM
                  memory errorsLinus Torvalds2021/03/04 09:21 PM
                    memory errorsanon22021/03/04 10:46 PM
                      memory errorsCarson2021/03/05 03:43 AM
                        memory errorsanon22021/03/05 08:55 AM
                    memory errorsgallier22021/03/05 03:22 AM
                  memory errorsdmcq2021/03/05 01:59 PM
                    memory errorsDavid Hess2021/03/06 05:27 AM
                      memory errorsCarson2021/03/06 07:44 AM
                      memory errorsGabriele Svelto2021/03/06 11:11 AM
                        memory errorsDavid Hess2021/03/06 11:28 AM
                          memory errorsMichael S2021/03/06 03:45 PM
              memory errorsDoug S2021/03/04 11:48 AM
                memory errorsMichael S2021/03/04 12:36 PM
              memory errorsJörn Engel2021/03/04 04:32 PM
                memory errorsLinus Torvalds2021/03/04 09:47 PM
                  memory errorsEtienne Lorrain2021/03/05 02:09 AM
                  memory errorsMichael S2021/03/05 05:06 AM
                    memory errorsLinus Torvalds2021/03/05 12:59 PM
                      memory errorsrwessel2021/03/05 01:32 PM
                        memory errorsrwessel2021/03/05 01:37 PM
                        memory errorszArchJon2021/03/06 09:39 PM
                      memory errorsGabriele Svelto2021/03/06 01:58 PM
                  memory errorsJörn Engel2021/03/05 11:12 AM
                Amiga recoverable RAM disk?Carson2021/03/05 04:03 AM
                  Thanks - TIL a cool Amiga feature (nt) (NT)John2021/03/05 01:51 PM
                    Another cool Amiga feature, datatypesCharles2021/03/06 01:01 AM
                      Another cool Amiga feature, datatypesJukka Larja2021/03/06 02:23 AM
                      Another cool Amiga feature, datatypesAnon2021/03/06 01:40 PM
                      Another cool Amiga feature, filesystemsMarcus2021/03/07 01:28 AM
  CPU & Memory bit flipszArchJon2021/03/04 07:39 AM
    CPU & Memory bit flipsdmcq2021/03/04 07:59 AM
      CPU & Memory bit flipsrwessel2021/03/04 01:27 PM
  speak of the devilRobert Williams2021/03/05 08:53 AM
    speak of the devildmcq2021/03/05 12:26 PM
      speak of the devilRobert Williams2021/03/05 04:15 PM
Reply to this Topic
Body: No Text
How do you spell tangerine? 🍊