By: noone (blah.delete@this.blah.com), March 21, 2021 9:15 am
Room: Moderated Discussions
Veedrac (ignore.delete@this.this.com) on March 21, 2021 1:33 am wrote:
> Chester (lamchester.delete@this.gmail.com) on March 20, 2021 7:54 pm wrote:
> > anon (anon.delete@this.gmail.com) on March 20, 2021 6:04 pm wrote:
> > > Veedrac (ignore.delete@this.this.com) on March 20, 2021 11:27 am wrote:
> > > > But all of those are second order. You need to fix speculation. You cannot hope
> > > > to have 30+ IPC unless you can speculate unboundedly, in a way that's immune to
> > > > branch prediction, false memory hazards, loops, function calls, and so on.
> > > >
> > > > As far as I know, this means microthreads. So if I was to rewrite the world, I'd start by figuring
> > > > out to build a core to handle a hundred plus microthreads with zero overhead, and work from
> > > > there. I don't think there's anything fundamentally in the way of a core like this.
> > >
> > > https://en.wikipedia.org/wiki/Cray_MTA
> > >
> > > for a historical example of a many threaded core.
> > >
> > > Similarly, IBM Power chips often have 8-way SMT (for the same reason of trying to always have
> > > work that you can make forward progress on). How would what you are suggesting differ from these?
> >
> > I think Veedrac wants to build a GPU. There you definitely have "a hundred plus microthreads" per
> > core (where core is an Nvidia SM, AMD CU, or Intel subslice), with zero overhead. Each thread is
> > generally independent of other threads, which means you can reorder execution unboundedly.
>
> You're both thinking ‘solve the execution problem’. I don't care about the execution problem
> right now. The 128 core Ampere Altra Max is months away, and while GPUs presumably have room
> to improve, they're already (as you say) way faster than anything I'm proposing.
>
> What I'm trying to solve is the *speculation* problem; the issue that two independently executable
> instructions in your program might not be a mere 600 instructions apart, such that Apple's
> ROB might be able to handle it, but 10,000 instructions, or 10,000,000, or maybe just a couple
> of thousand but with multiple loops of unpredictable length between them.
>
> This is distinct from muticore and GPU approaches because you might still just have a single pool of execution
> units that each cycle some subset of the microthreads are dispatched to, though you have more flexibility
> to not do that since sufficiently distant microthreads need not be able to talk to each other with a zero-cycle
> latency. And it's distinct from SMT because thread-to-thread communication is just register wakeup, albeit
> shaped a tad differently, and you can spawn a thread on practically every branch.
>
The fundamental concept with this idea is memory dependencies between the these microthreads and how to deal with them.
You should take a look at MIT's Swarm architecture, because I think it's exactly what you're talking about. The entire architecture constructs an enormous, distributed reorder buffer, built out of small hardware ROBs, with spawning and coordination controlled by software. And most importantly, a way of dealing with the memory dependencies between the spawned hardware threads. It's definitely not simple though.
> Chester (lamchester.delete@this.gmail.com) on March 20, 2021 7:54 pm wrote:
> > anon (anon.delete@this.gmail.com) on March 20, 2021 6:04 pm wrote:
> > > Veedrac (ignore.delete@this.this.com) on March 20, 2021 11:27 am wrote:
> > > > But all of those are second order. You need to fix speculation. You cannot hope
> > > > to have 30+ IPC unless you can speculate unboundedly, in a way that's immune to
> > > > branch prediction, false memory hazards, loops, function calls, and so on.
> > > >
> > > > As far as I know, this means microthreads. So if I was to rewrite the world, I'd start by figuring
> > > > out to build a core to handle a hundred plus microthreads with zero overhead, and work from
> > > > there. I don't think there's anything fundamentally in the way of a core like this.
> > >
> > > https://en.wikipedia.org/wiki/Cray_MTA
> > >
> > > for a historical example of a many threaded core.
> > >
> > > Similarly, IBM Power chips often have 8-way SMT (for the same reason of trying to always have
> > > work that you can make forward progress on). How would what you are suggesting differ from these?
> >
> > I think Veedrac wants to build a GPU. There you definitely have "a hundred plus microthreads" per
> > core (where core is an Nvidia SM, AMD CU, or Intel subslice), with zero overhead. Each thread is
> > generally independent of other threads, which means you can reorder execution unboundedly.
>
> You're both thinking ‘solve the execution problem’. I don't care about the execution problem
> right now. The 128 core Ampere Altra Max is months away, and while GPUs presumably have room
> to improve, they're already (as you say) way faster than anything I'm proposing.
>
> What I'm trying to solve is the *speculation* problem; the issue that two independently executable
> instructions in your program might not be a mere 600 instructions apart, such that Apple's
> ROB might be able to handle it, but 10,000 instructions, or 10,000,000, or maybe just a couple
> of thousand but with multiple loops of unpredictable length between them.
>
> This is distinct from muticore and GPU approaches because you might still just have a single pool of execution
> units that each cycle some subset of the microthreads are dispatched to, though you have more flexibility
> to not do that since sufficiently distant microthreads need not be able to talk to each other with a zero-cycle
> latency. And it's distinct from SMT because thread-to-thread communication is just register wakeup, albeit
> shaped a tad differently, and you can spawn a thread on practically every branch.
>
The fundamental concept with this idea is memory dependencies between the these microthreads and how to deal with them.
You should take a look at MIT's Swarm architecture, because I think it's exactly what you're talking about. The entire architecture constructs an enormous, distributed reorder buffer, built out of small hardware ROBs, with spawning and coordination controlled by software. And most importantly, a way of dealing with the memory dependencies between the spawned hardware threads. It's definitely not simple though.