By: wumpus (wumpus.delete.delete.delete@this.this.this.lost.in.a.hole), March 21, 2021 12:50 pm
Room: Moderated Discussions
Veedrac (ignore.delete@this.this.com) on March 20, 2021 11:27 am wrote:
> Moritz (better.delete@this.not.tell) on March 20, 2021 5:21 am wrote:
> > What if you could completely rethink the general processor concept?
>
> There are a thousand things a CPU does wrong. Memory is broken and wasteful. Protection mechanisms
> are archaic. The x86 encoding is garbage and distributing arch-specific binaries is sacrilege.
> Reorder buffers are irritatingly inefficient. SIMD instruction sets don't even try.
>
> But all of those are second order. You need to fix speculation. You cannot hope
> to have 30+ IPC unless you can speculate unboundedly, in a way that's immune to
> branch prediction, false memory hazards, loops, function calls, and so on.
>
> As far as I know, this means microthreads. So if I was to rewrite the world, I'd start by figuring
> out to build a core to handle a hundred plus microthreads with zero overhead, and work from
> there. I don't think there's anything fundamentally in the way of a core like this.
>
> (An additional thing I'd keep in mind is to make sure it's efficiently extensible to monolithic
> 3D silicon, since I'd like it to last, and monolithic 3D is a physical inevitability.)
Monolithic 3D might be required for such an architecture (ignoring the whole idea of rebuilding a trillion dollar infrastructure). One of the obvious physical difficulties of most of these "huge number of pipelines executing a single process" is getting the pipelines near the L1 caches. To much latency to the cache and your serial execution suffers (and Amdhal wallops you), try to get away from that and you wind up with modern CPU design.
Having all the cache directly underneath the execution areas could allow L1s that are far larger with low latency and vast bandwidth without much of the issues of snooping/cache coherency (although maintaining such coherency among more than one such processor will of course be interesting). I'd assume the "core" would be on top near the heatsink, some network to route loads and stores by address, and finally the cache. Even with low leakage transistors, the short distances should give you the speed you need.
Without the 3d caches, expect to rely on either tiny pipelines (and control logic for same) and/or magic prefetchers. This seems to be a hot research topic, but I've yet to hear of somebody shipping a product (likely because of the infrastructure issue and it can't beat a GPU at machine learning).
> Moritz (better.delete@this.not.tell) on March 20, 2021 5:21 am wrote:
> > What if you could completely rethink the general processor concept?
>
> There are a thousand things a CPU does wrong. Memory is broken and wasteful. Protection mechanisms
> are archaic. The x86 encoding is garbage and distributing arch-specific binaries is sacrilege.
> Reorder buffers are irritatingly inefficient. SIMD instruction sets don't even try.
>
> But all of those are second order. You need to fix speculation. You cannot hope
> to have 30+ IPC unless you can speculate unboundedly, in a way that's immune to
> branch prediction, false memory hazards, loops, function calls, and so on.
>
> As far as I know, this means microthreads. So if I was to rewrite the world, I'd start by figuring
> out to build a core to handle a hundred plus microthreads with zero overhead, and work from
> there. I don't think there's anything fundamentally in the way of a core like this.
>
> (An additional thing I'd keep in mind is to make sure it's efficiently extensible to monolithic
> 3D silicon, since I'd like it to last, and monolithic 3D is a physical inevitability.)
Monolithic 3D might be required for such an architecture (ignoring the whole idea of rebuilding a trillion dollar infrastructure). One of the obvious physical difficulties of most of these "huge number of pipelines executing a single process" is getting the pipelines near the L1 caches. To much latency to the cache and your serial execution suffers (and Amdhal wallops you), try to get away from that and you wind up with modern CPU design.
Having all the cache directly underneath the execution areas could allow L1s that are far larger with low latency and vast bandwidth without much of the issues of snooping/cache coherency (although maintaining such coherency among more than one such processor will of course be interesting). I'd assume the "core" would be on top near the heatsink, some network to route loads and stores by address, and finally the cache. Even with low leakage transistors, the short distances should give you the speed you need.
Without the 3d caches, expect to rely on either tiny pipelines (and control logic for same) and/or magic prefetchers. This seems to be a hot research topic, but I've yet to hear of somebody shipping a product (likely because of the infrastructure issue and it can't beat a GPU at machine learning).