Cray MTA avoided caches

By: Rayla (, May 20, 2021 10:28 am
Room: Moderated Discussions
dmcq ( on May 20, 2021 10:09 am wrote:
> Paul A. Clayton ( on May 20, 2021 6:36 am wrote:
> > Romain Dolbeau ( on May 19, 2021 4:05 am wrote:
> > > Little Horn ( on May 17, 2021 5:03 pm wrote:
> > > > Thoughts?
> > >
> > > Long before I reached the end of the paper (my bad, I know), the Cray MTA (formerly
> > > Tera) architecture came back to my mind... Massive multithreading didn't work
> > > then, didn't immediately see a reason why it would work today...
> >
> > Cray MTA avoided caches (which also assumes word-granular memory interfaces, implying a greater command
> > overhead and narrow memory channels (to support dense high-bandwidth DRAM using long bursts)). I
> > have not attentively read the paper, but it does assume caches and seems to assume a thread switch
> > latency of up to tens of cycles (compared to MTA's any thread immediately executable).
> >
> > (There was also no mention of MIPS MT Application Specific Extension, which did slightly
> > distinguish between a thread context and a virtual processing element.)
> >
> > I had composed a partial response to the original post, but now I think I will try to actually read
> > the paper (and the responses here) and compose a more considered response. From what I have read, the
> > authors seem to lack familiarity with hardware designs (particularly the GPU description and no mention
> > of 3D register files). I like the general idea of hardware having a larger role in thread scheduling,
> > but it seemed (from cursory reading) that the specific proposal was not well-thought-out.
> I don't know about the Cray design but it seems to me from what's described that it
> was based on the principle of a GPU, slower but much wider allowing lots of data to
> get around to where it is needed. A good choice for large computational problems.

A crucial difference is that GPUs map "threads" onto vectors and as a result control flow is potentially painful and expensive. On the XMT, you have FGMT across 128 threads, each of which is fully independent. The kind of loads that the XMT was built for were explicitly ones with highly irregular memory access patterns - which is not the typical candidate for use on a GPU.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
A Case Against (Most) Context SwitchesLittle Horn2021/05/17 05:03 PM
  A Case Against (Most) Context Switchesrwessel2021/05/17 06:55 PM
  A Case Against (Most) Context SwitchesFoo_2021/05/18 01:58 AM
    A Case Against (Most) Context SwitchesDoug S2021/05/18 08:45 AM
      A Case Against (Most) Context SwitchesKonrad Schwarz2021/05/19 07:35 AM
  A Case Against (Most) Context SwitchesEtienne Lorrain2021/05/18 03:11 AM
  A Case Against (Most) Context SwitchesAndrey2021/05/18 06:58 AM
  A Case Against (Most) Context Switchesgallier22021/05/18 08:41 AM
  A Case Against (Most) Context Switches---2021/05/18 09:00 AM
  A Case Against That Other PaperBrendan2021/05/18 12:37 PM
    A Case Against That Other PaperMark Roulo2021/05/18 03:32 PM
      A Case Against That Other PaperBrendan2021/05/18 11:05 PM
        A Case Against That Other PaperMark Roulo2021/05/19 01:09 PM
  A Case Against (Most) Context SwitchesRomain Dolbeau2021/05/19 04:05 AM
    A Case Against (Most) Context SwitchesBjörn Ragnar Björnsson2021/05/19 01:13 PM
      A Case Against ... authors show zero awareness of Cray-MTABjörn Ragnar Björnsson2021/05/19 06:18 PM
    Cray MTA avoided cachesPaul A. Clayton2021/05/20 06:36 AM
      Cray MTA avoided cachesdmcq2021/05/20 10:09 AM
        Cray MTA avoided cachesRayla2021/05/20 10:28 AM
      A LONG response to the paperPaul A. Clayton2021/05/22 06:15 AM
        A LONG response to the paperAdrian2021/05/22 09:18 AM
          Thank you for the note of appreciationPaul A. Clayton2021/05/24 05:06 AM
  A Case Against (Most) Context Switchesdmcq2021/05/19 01:47 PM
Reply to this Topic
Body: No Text
How do you spell tangerine? 🍊