By: wumpus (wumpus.delete@this.lost.in.a.hole), March 26, 2021 9:45 am
Room: Moderated Discussions
Paul A. Clayton (paaronclayton.delete@this.gmail.com) on March 26, 2021 8:21 am wrote:
> Moritz (better.delete@this.not.tell) on March 20, 2021 5:21 am wrote:
> > What if you could completely rethink the general processor concept?
>
> As already noted, this is a huge topic; it deserves a response on
> the scale of Donald Knuth's The Art of Computer Programming.
>
I think you've hit the elephant in the room. A modern core is a means of exploiting the locality of ~8-64k of cache with a ~1ns latency. With sufficiently predictable code, it might even hit the locality of 256-512k L2, but it would seem wise not to design around that (floating point sides might be an exception).
Most of the exotic means I've seen to decouple serial execution don't seem to make any concessions to the size of the caches used or how far away from the registers and execution units they will have to be. Too big a cache and the latency goes up. Too many logical units (for high-IPC) around the L1 caches and the latency goes up. Increasing the number of L1 caches means you now have a much bigger networking problem (and creating an effective kilocore processor might be a good first step in such a computer. One huge problem at a time).
As you mentioned, one method to avoid this would be to have good enough prefetching that your latency and/or size isn't critical (you are effectively pulling from a higher cache with the latency of the lower cache). I've been gobsmacked at the size of Zen's branch prediction unit. Here we have a massive unit processing huge amounts of data on the CPU's current state. I strongly suspect that having such a unit makes anyone who needs a good 'ask the Oracle' answer (such as pre-fetching and possibly which microinstruction to send next to which pipeline) would also use the branch prediction unit as such an oracle.
One other way to circumvent this issue would be a 3d chip (Veedrac suggested a next generation machine would need to work in 3d, I suggested that it is possible that many such designs couldn't work without it). With L1 underneath the logic and presumably with short wires and low inductance, they could have reasonable latency with large size and large logic pipelines.
> Moritz (better.delete@this.not.tell) on March 20, 2021 5:21 am wrote:
> > What if you could completely rethink the general processor concept?
>
> As already noted, this is a huge topic; it deserves a response on
> the scale of Donald Knuth's The Art of Computer Programming.
>
I think you've hit the elephant in the room. A modern core is a means of exploiting the locality of ~8-64k of cache with a ~1ns latency. With sufficiently predictable code, it might even hit the locality of 256-512k L2, but it would seem wise not to design around that (floating point sides might be an exception).
Most of the exotic means I've seen to decouple serial execution don't seem to make any concessions to the size of the caches used or how far away from the registers and execution units they will have to be. Too big a cache and the latency goes up. Too many logical units (for high-IPC) around the L1 caches and the latency goes up. Increasing the number of L1 caches means you now have a much bigger networking problem (and creating an effective kilocore processor might be a good first step in such a computer. One huge problem at a time).
As you mentioned, one method to avoid this would be to have good enough prefetching that your latency and/or size isn't critical (you are effectively pulling from a higher cache with the latency of the lower cache). I've been gobsmacked at the size of Zen's branch prediction unit. Here we have a massive unit processing huge amounts of data on the CPU's current state. I strongly suspect that having such a unit makes anyone who needs a good 'ask the Oracle' answer (such as pre-fetching and possibly which microinstruction to send next to which pipeline) would also use the branch prediction unit as such an oracle.
One other way to circumvent this issue would be a 3d chip (Veedrac suggested a next generation machine would need to work in 3d, I suggested that it is possible that many such designs couldn't work without it). With L1 underneath the logic and presumably with short wires and low inductance, they could have reasonable latency with large size and large logic pipelines.