By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), January 7, 2021 10:26 am
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on January 7, 2021 8:05 am wrote:
>
> Single-bit flips are not a problem for attacker, in theory they are monitored, but not quickly
> enough to react.
Why do you think that is the case?
All the realistic (as opposed to "look, I can do this") cases for attackers are based on a huge amount of luck. For an attacker, it's not enough to flip a bit - you need to flip just the right bit to get anywhere else. The realistic situations depend on massive parallel attacks.
You'd not attack a single machine, for example, because it would just take too long (and if you care about one particular machine that much, and have the resources, you'd be doing other things anyway). You might write an otherwise useful mobile app, try to get hundreds of thousands of people to install it, and then do the rowhammer attack in the background.
It might take millions of CPU hours to actually get something useful as an attack, and if the phone vendor had ECC and gathering reliability data - including single-bit ECC events - there is no reason to believe that it wouldn't be noticed fairly quickly.
And once you see "oh, something is causing ECC errors", finding the bad actor by just correlation with installed and running apps shouldn't be a huge deal.
But the fact is, just a phone having ECC in the first place would have made the attacker probably go "ok, this isn't even worth it".
Seriously, all the "ECC didn't stop rowhammer" stories were the usual security people crying wolf.
The real problem being that ECC is so rare in the mass market that you find that not only does it generally exist at all (ie the whole "phone with ECC" is just a fever dream of mine right now), but even when people do have ECC, all the infrastructure to take advantage of it is very very weak.
I bet a lot of ECC setups are set to just silently fix ECC errors and not even report them, because there just aren't great tools for it. If it exists, it's probably just a message somewhere else.
And yes, that includes things like the Linux kernel. Because almost nobody has ECC, guess how common and well-supported (and tested) reporting tools and the EDAC drivers are? Even if you have some system monitor that shows you temperature and CPU frequencies etc, do you think it warns about ECC errors? Probably not, because the person who maintains the GUI tools probably doesn't have ECC.
This is the whole problem with a weak ECC market. Yes, it makes ECC more expensive and harder to find, but it also makes ECC not work as well as it should. You've seen all the same postings I have about how people do extra work just to make sure ECC works - because out of the box, ECC support is just not great (starting with the motherboards, but also the BIOS, the kernel, the user space admin tools etc etc etc).
So AMD having ECC helps. But let's face it - even if everybody got religion tomorrow, and the DDR standards body got their heads out of their nether regions, it would take years before the effects of decades of lacking ECC support would actually improve.
And yes, this has been a pet peeve of mine for decades. Back when, Intel used to have these "Intel Technology Days" events every year for Linux kernel people, where they discussed upcoming technologies etc. ECC was my #1 ask for years. To the point where I would start my "What does the kernel want from Intel" with "I mentioned ECC last year, I'll mention it again".
Linus
>
> Single-bit flips are not a problem for attacker, in theory they are monitored, but not quickly
> enough to react.
Why do you think that is the case?
All the realistic (as opposed to "look, I can do this") cases for attackers are based on a huge amount of luck. For an attacker, it's not enough to flip a bit - you need to flip just the right bit to get anywhere else. The realistic situations depend on massive parallel attacks.
You'd not attack a single machine, for example, because it would just take too long (and if you care about one particular machine that much, and have the resources, you'd be doing other things anyway). You might write an otherwise useful mobile app, try to get hundreds of thousands of people to install it, and then do the rowhammer attack in the background.
It might take millions of CPU hours to actually get something useful as an attack, and if the phone vendor had ECC and gathering reliability data - including single-bit ECC events - there is no reason to believe that it wouldn't be noticed fairly quickly.
And once you see "oh, something is causing ECC errors", finding the bad actor by just correlation with installed and running apps shouldn't be a huge deal.
But the fact is, just a phone having ECC in the first place would have made the attacker probably go "ok, this isn't even worth it".
Seriously, all the "ECC didn't stop rowhammer" stories were the usual security people crying wolf.
The real problem being that ECC is so rare in the mass market that you find that not only does it generally exist at all (ie the whole "phone with ECC" is just a fever dream of mine right now), but even when people do have ECC, all the infrastructure to take advantage of it is very very weak.
I bet a lot of ECC setups are set to just silently fix ECC errors and not even report them, because there just aren't great tools for it. If it exists, it's probably just a message somewhere else.
And yes, that includes things like the Linux kernel. Because almost nobody has ECC, guess how common and well-supported (and tested) reporting tools and the EDAC drivers are? Even if you have some system monitor that shows you temperature and CPU frequencies etc, do you think it warns about ECC errors? Probably not, because the person who maintains the GUI tools probably doesn't have ECC.
This is the whole problem with a weak ECC market. Yes, it makes ECC more expensive and harder to find, but it also makes ECC not work as well as it should. You've seen all the same postings I have about how people do extra work just to make sure ECC works - because out of the box, ECC support is just not great (starting with the motherboards, but also the BIOS, the kernel, the user space admin tools etc etc etc).
So AMD having ECC helps. But let's face it - even if everybody got religion tomorrow, and the DDR standards body got their heads out of their nether regions, it would take years before the effects of decades of lacking ECC support would actually improve.
And yes, this has been a pet peeve of mine for decades. Back when, Intel used to have these "Intel Technology Days" events every year for Linux kernel people, where they discussed upcoming technologies etc. ECC was my #1 ask for years. To the point where I would start my "What does the kernel want from Intel" with "I mentioned ECC last year, I'll mention it again".
Linus