By: Veedrac (ignore.delete@this.this.com), March 21, 2021 1:33 am
Room: Moderated Discussions
Chester (lamchester.delete@this.gmail.com) on March 20, 2021 7:54 pm wrote:
> anon (anon.delete@this.gmail.com) on March 20, 2021 6:04 pm wrote:
> > Veedrac (ignore.delete@this.this.com) on March 20, 2021 11:27 am wrote:
> > > But all of those are second order. You need to fix speculation. You cannot hope
> > > to have 30+ IPC unless you can speculate unboundedly, in a way that's immune to
> > > branch prediction, false memory hazards, loops, function calls, and so on.
> > >
> > > As far as I know, this means microthreads. So if I was to rewrite the world, I'd start by figuring
> > > out to build a core to handle a hundred plus microthreads with zero overhead, and work from
> > > there. I don't think there's anything fundamentally in the way of a core like this.
> >
> > https://en.wikipedia.org/wiki/Cray_MTA
> >
> > for a historical example of a many threaded core.
> >
> > Similarly, IBM Power chips often have 8-way SMT (for the same reason of trying to always have
> > work that you can make forward progress on). How would what you are suggesting differ from these?
>
> I think Veedrac wants to build a GPU. There you definitely have "a hundred plus microthreads" per
> core (where core is an Nvidia SM, AMD CU, or Intel subslice), with zero overhead. Each thread is
> generally independent of other threads, which means you can reorder execution unboundedly.
You're both thinking ‘solve the execution problem’. I don't care about the execution problem right now. The 128 core Ampere Altra Max is months away, and while GPUs presumably have room to improve, they're already (as you say) way faster than anything I'm proposing.
What I'm trying to solve is the *speculation* problem; the issue that two independently executable instructions in your program might not be a mere 600 instructions apart, such that Apple's ROB might be able to handle it, but 10,000 instructions, or 10,000,000, or maybe just a couple of thousand but with multiple loops of unpredictable length between them.
This is distinct from muticore and GPU approaches because you might still just have a single pool of execution units that each cycle some subset of the microthreads are dispatched to, though you have more flexibility to not do that since sufficiently distant microthreads need not be able to talk to each other with a zero-cycle latency. And it's distinct from SMT because thread-to-thread communication is just register wakeup, albeit shaped a tad differently, and you can spawn a thread on practically every branch.
> anon (anon.delete@this.gmail.com) on March 20, 2021 6:04 pm wrote:
> > Veedrac (ignore.delete@this.this.com) on March 20, 2021 11:27 am wrote:
> > > But all of those are second order. You need to fix speculation. You cannot hope
> > > to have 30+ IPC unless you can speculate unboundedly, in a way that's immune to
> > > branch prediction, false memory hazards, loops, function calls, and so on.
> > >
> > > As far as I know, this means microthreads. So if I was to rewrite the world, I'd start by figuring
> > > out to build a core to handle a hundred plus microthreads with zero overhead, and work from
> > > there. I don't think there's anything fundamentally in the way of a core like this.
> >
> > https://en.wikipedia.org/wiki/Cray_MTA
> >
> > for a historical example of a many threaded core.
> >
> > Similarly, IBM Power chips often have 8-way SMT (for the same reason of trying to always have
> > work that you can make forward progress on). How would what you are suggesting differ from these?
>
> I think Veedrac wants to build a GPU. There you definitely have "a hundred plus microthreads" per
> core (where core is an Nvidia SM, AMD CU, or Intel subslice), with zero overhead. Each thread is
> generally independent of other threads, which means you can reorder execution unboundedly.
You're both thinking ‘solve the execution problem’. I don't care about the execution problem right now. The 128 core Ampere Altra Max is months away, and while GPUs presumably have room to improve, they're already (as you say) way faster than anything I'm proposing.
What I'm trying to solve is the *speculation* problem; the issue that two independently executable instructions in your program might not be a mere 600 instructions apart, such that Apple's ROB might be able to handle it, but 10,000 instructions, or 10,000,000, or maybe just a couple of thousand but with multiple loops of unpredictable length between them.
This is distinct from muticore and GPU approaches because you might still just have a single pool of execution units that each cycle some subset of the microthreads are dispatched to, though you have more flexibility to not do that since sufficiently distant microthreads need not be able to talk to each other with a zero-cycle latency. And it's distinct from SMT because thread-to-thread communication is just register wakeup, albeit shaped a tad differently, and you can spawn a thread on practically every branch.