By: Adrian (a.delete@this.acm.org), January 4, 2021 5:49 am
Room: Moderated Discussions
Dan Strother (dan.strother.delete@this.gmail.com) on January 3, 2021 7:00 pm wrote:
> Adrian (a.delete@this.acm.org) on January 3, 2021 1:34 pm wrote:
> > > > On Linux, there was the kernel option CONFIG_EDAC_AMD64_ERROR_INJECTION, which would
> > > > allow access to the testing features of the memory controller on older AMD CPUs.
> > > >
> > The code in /usr/src/linux/drivers/edac/amd64_edac_inj.c does not have any mention about compatibility
> > with a specific CPU family, so hopefully these functions continue to work on the current AMD CPUs.
> >
> > That option is off by default, so the kernel must be recompiled to activate it and I must search the Linux
> > documentation to discover what must be written into /sys/devices/system/edac/mc/ to inject errors.
> >
> > When I will have some spare time, I will test if this works.
>
> Note that even if the CPU supports ECC error injection, the
> BIOS may disable it. See this post for some context:
> https://hardwarecanucks.com/forum/threads/ecc-memory-amds-ryzen-a-deep-dive-comment-thread.75041/page-6#post-902700
>
> In that post, the poster is trying to use MemTest86 Pro (the paid PassMark version, not the free one)
> to inject errors on their Ryzen 3000 system with an ASRock Rack motherboard. They had to change the "Disable
> Memory Error Injection" option in the motherboard's BIOS to enable injection. Unfortunately, they weren't
> able to confirm that it was actually working - errors appeared to be injected, but then went unreported
> (even worse, ASRock Rack support then claimed that ECC reporting wasn't supported at all!).
>
> I found this some months ago during my Xeon vs Ryzen ECC research; there may be better posts now.
> It's anecdotes like this one that steered me away from Ryzen at the time. I wish I had come across
> your positive reports then, Adrian - I might have wound up with a Ryzen system instead.
>
> I also have a vague recollection that I found some comments suggesting that error injection was fused
> off on Ryzen 3000 CPUs (but had been supported on 2000 CPUs), but I'm unable to find this in my notes.
> The MemTest86 version history does have some interesting comments around injection support on AMD:
> https://www.memtest86.com/whats-new.html
>
> For example: "Added ECC detection/injection support for AMD Ryzen chipsets. Note that injection
> support is typically disabled by AMD, except for some CPUs which are engineering samples."
> And: "Added warning message when failing to inject ECC errors
> for Ryzen chipsets (due to being disabled in production)"
>
> Does AMD document any of this in their public datasheets? (I haven't tried looking yet..)
Yes, in the BIOS of my ASUS workstation motherboard there is also a setting to enable error injection, so I had to enable that.
I have also recompiled the Linux kernel with error injection enabled, so I have now 5 new entries in sysfs (/sys/devices/system/edac/mc/mc0/inject_*).
Nevertheless, I did not do the test yet, because I need to read first the source code of the AMD edac driver, when I will have time, because the Linux documentation is not up-to-date.
On old kernels, e.g. 4.9 or older, the ECC documentation was in:
/usr/src/linux/Documentation/edac.txt
On new kernels, e.g. 4.14 or newer, the ECC documentation is in:
/usr/src/linux/Documentation/admin-guide/ras.rst
Additional documentation is in:
/usr/src/linux/Documentation/ABI/testing/sysfs-devices-edac
However, all those documents are useless for my test, because they describe how to inject errors only for Intel systems, not for AMD systems, which have completely different entries in sysfs, which are not mentioned anywhere in the Linux documentation.
I hate when developers modify the source code without also updating the documentation, but at least we have the source of the AMD driver, so I should be able to get from there what values I must write into sysfs and where.
I might not have time today to do the test, but by tomorrow I will do it.
> Adrian (a.delete@this.acm.org) on January 3, 2021 1:34 pm wrote:
> > > > On Linux, there was the kernel option CONFIG_EDAC_AMD64_ERROR_INJECTION, which would
> > > > allow access to the testing features of the memory controller on older AMD CPUs.
> > > >
> > The code in /usr/src/linux/drivers/edac/amd64_edac_inj.c does not have any mention about compatibility
> > with a specific CPU family, so hopefully these functions continue to work on the current AMD CPUs.
> >
> > That option is off by default, so the kernel must be recompiled to activate it and I must search the Linux
> > documentation to discover what must be written into /sys/devices/system/edac/mc/ to inject errors.
> >
> > When I will have some spare time, I will test if this works.
>
> Note that even if the CPU supports ECC error injection, the
> BIOS may disable it. See this post for some context:
> https://hardwarecanucks.com/forum/threads/ecc-memory-amds-ryzen-a-deep-dive-comment-thread.75041/page-6#post-902700
>
> In that post, the poster is trying to use MemTest86 Pro (the paid PassMark version, not the free one)
> to inject errors on their Ryzen 3000 system with an ASRock Rack motherboard. They had to change the "Disable
> Memory Error Injection" option in the motherboard's BIOS to enable injection. Unfortunately, they weren't
> able to confirm that it was actually working - errors appeared to be injected, but then went unreported
> (even worse, ASRock Rack support then claimed that ECC reporting wasn't supported at all!).
>
> I found this some months ago during my Xeon vs Ryzen ECC research; there may be better posts now.
> It's anecdotes like this one that steered me away from Ryzen at the time. I wish I had come across
> your positive reports then, Adrian - I might have wound up with a Ryzen system instead.
>
> I also have a vague recollection that I found some comments suggesting that error injection was fused
> off on Ryzen 3000 CPUs (but had been supported on 2000 CPUs), but I'm unable to find this in my notes.
> The MemTest86 version history does have some interesting comments around injection support on AMD:
> https://www.memtest86.com/whats-new.html
>
> For example: "Added ECC detection/injection support for AMD Ryzen chipsets. Note that injection
> support is typically disabled by AMD, except for some CPUs which are engineering samples."
> And: "Added warning message when failing to inject ECC errors
> for Ryzen chipsets (due to being disabled in production)"
>
> Does AMD document any of this in their public datasheets? (I haven't tried looking yet..)
Yes, in the BIOS of my ASUS workstation motherboard there is also a setting to enable error injection, so I had to enable that.
I have also recompiled the Linux kernel with error injection enabled, so I have now 5 new entries in sysfs (/sys/devices/system/edac/mc/mc0/inject_*).
Nevertheless, I did not do the test yet, because I need to read first the source code of the AMD edac driver, when I will have time, because the Linux documentation is not up-to-date.
On old kernels, e.g. 4.9 or older, the ECC documentation was in:
/usr/src/linux/Documentation/edac.txt
On new kernels, e.g. 4.14 or newer, the ECC documentation is in:
/usr/src/linux/Documentation/admin-guide/ras.rst
Additional documentation is in:
/usr/src/linux/Documentation/ABI/testing/sysfs-devices-edac
However, all those documents are useless for my test, because they describe how to inject errors only for Intel systems, not for AMD systems, which have completely different entries in sysfs, which are not mentioned anywhere in the Linux documentation.
I hate when developers modify the source code without also updating the documentation, but at least we have the source of the AMD driver, so I should be able to get from there what values I must write into sysfs and where.
I might not have time today to do the test, but by tomorrow I will do it.