By: Jason Snyder (jmcsnyder.delete@this.hotmail.com), January 14, 2021 7:32 am
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on January 5, 2021 12:29 pm wrote:
> I've seen some ridiculous garbage in this thread, like "gamers don't want ECC". That's total drivel. Gamers
> that build high-end gaming workstations for overclocking should be some of the main target of ECC, because
> without ECC you don't really know if - and how high - you can safely overclock your RAM.
>
> ECC is safer under normal circumstances, but it also allows you to do more and live
> on the edge more, in other words. You can actually see when you're getting too close
> to the edge when the machine starts reporting a lot of correctable errors!
>
> Don't fall for the bullshit. ECC is not for servers. ECC is for everybody, and wanting
> to pay a bit extra for RAM shouldn't mean that you are then limited in other ways.
>
> Linus
Granted it would be nice to have ECC in every system, I think the notion of overclocking is overrated these days. One problem is games, especially on high resolution displays, is usually GPU bound or bound by the refresh rate of the display. Games look most smooth when you generate the frames at the rate of the display's refresh rate, not more or less. (And if you have a variable refresh rate display, having a few frame renders blow their timing window is no biggy because you just move the goal post a little here and there and all is good.) The second problem, and I have an custom loop liquid cooled gaming rig, is the gaming CPUs tend to have a very high temperature delta between the die and IHS. So I found my top of the line Nvidia GPU would maybe get up to 45C at sustained full load on a warm day, but the CPU would get up to 80's C while first in the loop at the same time. I got it down to the low 60's C by de-lidding the CPU, applying a liquid metal thermal compound, and then using a custom IHS to maximize heat dissipation. This is running the CPU at stock speeds. These CPUs are so close to the limit with non-linear increase in power consumption as you overclock and stability issues, you really can't go much beyond stock. So it is like risk destroying your computer for a very small bump you will never notice or just keep it at stock? Where I really saw the speed bump was moving the GPU to liquid cooling as with the stock cooler it would get to ~1,600 Mhz and thermally throttle to ~180W to maintain 85C while with liquid cooling I was able to push to 2,050 MHz and 300+W and still keep temps down to 45C (or less on a cooler day).
Considering the above, if my gaming system does mess up, just replace the failed part (say bad RAM), re-install any corrupted games, and I am back up and running as all game state is saved on the cloud. It doesn't really matter if the system became unstable at some point and there is not much point in pushing the CPU into unstable territory. Actually when the 3 year old Intel SSD failed about a month ago at around a P/E cycle count of 7 (which is kind of sad to think because I had to replace the SSD array in my main Linux box after the drives got close to their 1,500 P/E cycle design limit), I replaced it with a Samsung SSD, did a Clonezilla restore (which was a little tricky as the Samsung SSD turned out to be a few GBs smaller), let the system do automatic updates, a couple re-installs, and good to go.
Where I have really been hit by the ECC problem is work. Work usually provides a non ECC desktop or laptop, I start doing my software development on it, and it screws up and corrupts things. Usually this is in the form of partial system crash or flashy things on the screen, but sometimes also funny characters showing up in random spots, say when I re-open a file after getting a weird error on a line where it looks fine initially and the corrupted character(s) show up after closing and re-opening the source file. Even if it is not my system, I pull updates from the repo and find my co-worker's updates has funny characters in it and such because their non-ECC system screwed up and corrupted the code base.
> I've seen some ridiculous garbage in this thread, like "gamers don't want ECC". That's total drivel. Gamers
> that build high-end gaming workstations for overclocking should be some of the main target of ECC, because
> without ECC you don't really know if - and how high - you can safely overclock your RAM.
>
> ECC is safer under normal circumstances, but it also allows you to do more and live
> on the edge more, in other words. You can actually see when you're getting too close
> to the edge when the machine starts reporting a lot of correctable errors!
>
> Don't fall for the bullshit. ECC is not for servers. ECC is for everybody, and wanting
> to pay a bit extra for RAM shouldn't mean that you are then limited in other ways.
>
> Linus
Granted it would be nice to have ECC in every system, I think the notion of overclocking is overrated these days. One problem is games, especially on high resolution displays, is usually GPU bound or bound by the refresh rate of the display. Games look most smooth when you generate the frames at the rate of the display's refresh rate, not more or less. (And if you have a variable refresh rate display, having a few frame renders blow their timing window is no biggy because you just move the goal post a little here and there and all is good.) The second problem, and I have an custom loop liquid cooled gaming rig, is the gaming CPUs tend to have a very high temperature delta between the die and IHS. So I found my top of the line Nvidia GPU would maybe get up to 45C at sustained full load on a warm day, but the CPU would get up to 80's C while first in the loop at the same time. I got it down to the low 60's C by de-lidding the CPU, applying a liquid metal thermal compound, and then using a custom IHS to maximize heat dissipation. This is running the CPU at stock speeds. These CPUs are so close to the limit with non-linear increase in power consumption as you overclock and stability issues, you really can't go much beyond stock. So it is like risk destroying your computer for a very small bump you will never notice or just keep it at stock? Where I really saw the speed bump was moving the GPU to liquid cooling as with the stock cooler it would get to ~1,600 Mhz and thermally throttle to ~180W to maintain 85C while with liquid cooling I was able to push to 2,050 MHz and 300+W and still keep temps down to 45C (or less on a cooler day).
Considering the above, if my gaming system does mess up, just replace the failed part (say bad RAM), re-install any corrupted games, and I am back up and running as all game state is saved on the cloud. It doesn't really matter if the system became unstable at some point and there is not much point in pushing the CPU into unstable territory. Actually when the 3 year old Intel SSD failed about a month ago at around a P/E cycle count of 7 (which is kind of sad to think because I had to replace the SSD array in my main Linux box after the drives got close to their 1,500 P/E cycle design limit), I replaced it with a Samsung SSD, did a Clonezilla restore (which was a little tricky as the Samsung SSD turned out to be a few GBs smaller), let the system do automatic updates, a couple re-installs, and good to go.
Where I have really been hit by the ECC problem is work. Work usually provides a non ECC desktop or laptop, I start doing my software development on it, and it screws up and corrupts things. Usually this is in the form of partial system crash or flashy things on the screen, but sometimes also funny characters showing up in random spots, say when I re-open a file after getting a weird error on a line where it looks fine initially and the corrupted character(s) show up after closing and re-opening the source file. Even if it is not my system, I pull updates from the repo and find my co-worker's updates has funny characters in it and such because their non-ECC system screwed up and corrupted the code base.