By: Rayla (rayla.delete@this.example.com), May 20, 2021 10:28 am
Room: Moderated Discussions
dmcq (dmcq.delete@this.fano.co.uk) on May 20, 2021 10:09 am wrote:
> Paul A. Clayton (paaronclayton.delete@this.gmail.com) on May 20, 2021 6:36 am wrote:
> > Romain Dolbeau (romain.delete@this.dolbeau.org) on May 19, 2021 4:05 am wrote:
> > > Little Horn (sink.delete@this.example.net) on May 17, 2021 5:03 pm wrote:
> > > > Thoughts?
> > >
> > > Long before I reached the end of the paper (my bad, I know), the Cray MTA (formerly
> > > Tera) architecture came back to my mind... Massive multithreading didn't work
> > > then, didn't immediately see a reason why it would work today...
> >
> > Cray MTA avoided caches (which also assumes word-granular memory interfaces, implying a greater command
> > overhead and narrow memory channels (to support dense high-bandwidth DRAM using long bursts)). I
> > have not attentively read the paper, but it does assume caches and seems to assume a thread switch
> > latency of up to tens of cycles (compared to MTA's any thread immediately executable).
> >
> > (There was also no mention of MIPS MT Application Specific Extension, which did slightly
> > distinguish between a thread context and a virtual processing element.)
> >
> > I had composed a partial response to the original post, but now I think I will try to actually read
> > the paper (and the responses here) and compose a more considered response. From what I have read, the
> > authors seem to lack familiarity with hardware designs (particularly the GPU description and no mention
> > of 3D register files). I like the general idea of hardware having a larger role in thread scheduling,
> > but it seemed (from cursory reading) that the specific proposal was not well-thought-out.
>
> I don't know about the Cray design but it seems to me from what's described that it
> was based on the principle of a GPU, slower but much wider allowing lots of data to
> get around to where it is needed. A good choice for large computational problems.
>
>
A crucial difference is that GPUs map "threads" onto vectors and as a result control flow is potentially painful and expensive. On the XMT, you have FGMT across 128 threads, each of which is fully independent. The kind of loads that the XMT was built for were explicitly ones with highly irregular memory access patterns - which is not the typical candidate for use on a GPU.
> Paul A. Clayton (paaronclayton.delete@this.gmail.com) on May 20, 2021 6:36 am wrote:
> > Romain Dolbeau (romain.delete@this.dolbeau.org) on May 19, 2021 4:05 am wrote:
> > > Little Horn (sink.delete@this.example.net) on May 17, 2021 5:03 pm wrote:
> > > > Thoughts?
> > >
> > > Long before I reached the end of the paper (my bad, I know), the Cray MTA (formerly
> > > Tera) architecture came back to my mind... Massive multithreading didn't work
> > > then, didn't immediately see a reason why it would work today...
> >
> > Cray MTA avoided caches (which also assumes word-granular memory interfaces, implying a greater command
> > overhead and narrow memory channels (to support dense high-bandwidth DRAM using long bursts)). I
> > have not attentively read the paper, but it does assume caches and seems to assume a thread switch
> > latency of up to tens of cycles (compared to MTA's any thread immediately executable).
> >
> > (There was also no mention of MIPS MT Application Specific Extension, which did slightly
> > distinguish between a thread context and a virtual processing element.)
> >
> > I had composed a partial response to the original post, but now I think I will try to actually read
> > the paper (and the responses here) and compose a more considered response. From what I have read, the
> > authors seem to lack familiarity with hardware designs (particularly the GPU description and no mention
> > of 3D register files). I like the general idea of hardware having a larger role in thread scheduling,
> > but it seemed (from cursory reading) that the specific proposal was not well-thought-out.
>
> I don't know about the Cray design but it seems to me from what's described that it
> was based on the principle of a GPU, slower but much wider allowing lots of data to
> get around to where it is needed. A good choice for large computational problems.
>
>
A crucial difference is that GPUs map "threads" onto vectors and as a result control flow is potentially painful and expensive. On the XMT, you have FGMT across 128 threads, each of which is fully independent. The kind of loads that the XMT was built for were explicitly ones with highly irregular memory access patterns - which is not the typical candidate for use on a GPU.