By: Maynard Handley (name99.delete@this.name99.org), January 3, 2014 5:01 am
Room: Moderated Discussions
David Kanter (dkanter.delete@this.realworldtech.com) on January 3, 2014 2:12 am wrote:
> Maynard Handley (name99.delete@this.name99.org) on January 3, 2014 12:06 am wrote:
> > Maynard Handley (name99.delete@this.redheron9.com) on December 31, 2013 5:26 pm wrote:
> >
> > > Summary: I think it's true (and mostly agreed) that SW prefetching is dead. What's not widely known is
> > > the extent of interesting HW replacements, or the extent to which any of these are yet implemented.
> >
> > In the context of what I wrote earlier, there seems to be an emerging consensus around future
> > architectures that takes the form of slipping long-latency instructions (and their chains of
> > dependent instructions, possibly hundreds to thousands of instructions long) aside to get at
> > the independent instructions that can run. The details are now only in exactly how this is done
> > so as to replace the ROB with as low power, low area, and low complexity as possible.
> > Along these lines we have FlowForward, CPR (Checkpoint Processing
> > and Recovery) and its successor/amplification
> > CFP (Continual Flow Processing) and now I see DOE, Disjoint Out-of-Order Execution
> > http://j92a21b.ee.ncku.edu.tw/broad/report100/2012-12-24/Disjoint%20OOO%20Execution%20Proc%2012.pdf
> >
> > Looked at from a thousand miles up, this last one, in particular,
> > sounds somewhat like Apple's infamous MacroScalar
> > stuff I mentioned in my last post... The details and the concentration differs, sure, but the abstract view
> > seems, in all these cases, to be to create, on the fly, long long LONG chains of instructions such that all
> > instructions in a chain are dependent, but the various chains
> > are independent of each other (except to the extent
> > that they fork from the occasional starting point, and join
> > again at join point). Once you have these chains,
> > it's a somewhat orthogonal question whether you run them
> > on the same "CPU" (CFP, FlowForward), on kinda sorta
> > but not quite the same CPU (MacroScalar), or different CPUs (kinda sorta the DOE stuff).
> >
> > Does anyone have an opinion on how real this stuff is? The CPR/CFP/DOE chain of ideas is
> > based on people at Intel, but that obviously doesn't mean Intel are ready to bet on it.
> > Part of me thinks it would take working silicon from a university (kinda like
> > SPARC/MIPS in the 80s) to really validate the idea and make it worth swapping
> > in for the tried and trusted OoO ROB engines that everyone uses today.
> > And part of me hopes that Apple, as the one player in this space that has not been burned by over-ambition
> > on the CPU front, might just be audacious enough to look
> > at these numbers ("hmm, we can get a CPU that's about
> > 50% faster than our existing A7, in smaller area and lower
> > power, and that will scale well to higher frequencies.
> > Hell, let's take on Intel and ARM head-on and go into the business of selling CPUs to everyone")
> > Though I'd be just as happy if nVidia or Qualcomm or AMD were desperate enough to make
> > a splash that they took these ideas and ran with them. I've been looking at a bunch of
> > ideas for how to handle memory latency, and this collection seem the most promising.
> >
> > What worries me is that the gap between our optimized OoO engines today and the retooling you'd need for
> > these alternative ideas is so large that it's not easy to get from here to there. I THINK you could do it
> > in stages by starting with a simple (hah!) OoO core like an ARM9 or ARM15 and initially just replacing the
> > ROB with checkpoints. With that working and giving you, say, 20%, you could then, next generation, replace
> > the instruction window/scheduler with the CFP data buffers,
> > giving you another 20% or so, and another sellable
> > product, then finally add in the DOE weirdness to add a few more percent and speed up multi-threaded apps.
> > But even with this slicing up, each stage is a fairly ambitious engineering project...
>
> The investment in OOO is largely why I like speculative multi-threading - it can
> be used in conjunction with OOO and creates an orthogonal level of parallelism.
>
> David
SpMT is orthogonal, yes, but it seems to lead in the direction I suggest.
In the late 80s a cluster of apparently independent ideas (super scalar, out of order, branch prediction) came together in a single model. You could have used one of them without the others (e.g. Pentium used SS and BP but not OoO) but there was a synergistic effect, and they kinda fitted together.
DOE is basically SpMT, but done assuming a CFP core rather than an OoO core. As I suggested in my path to the future, you can slide these things in one at a time, but once you accept the basic concept --- on the fly I'm going to split my instruction stream into mutually independent sub streams, and run each of those as best I can --- you're led down a path where you start to ask "why am I still paying all the OoO cost when what I really want is a rather different set of underlying data structures keeping this whole mess consistent?"
But maybe you are right, in the sense that the first move we will see (the safest in the sense of building up some of the necessary infrastructure, while you can switch it off if you can't get it working in time for shipping) is just to add SpMT on top of an existing OoO core to goose it by about 15% or so. Not as much as you'd get from "doing things right" but a safe learning path.
> Maynard Handley (name99.delete@this.name99.org) on January 3, 2014 12:06 am wrote:
> > Maynard Handley (name99.delete@this.redheron9.com) on December 31, 2013 5:26 pm wrote:
> >
> > > Summary: I think it's true (and mostly agreed) that SW prefetching is dead. What's not widely known is
> > > the extent of interesting HW replacements, or the extent to which any of these are yet implemented.
> >
> > In the context of what I wrote earlier, there seems to be an emerging consensus around future
> > architectures that takes the form of slipping long-latency instructions (and their chains of
> > dependent instructions, possibly hundreds to thousands of instructions long) aside to get at
> > the independent instructions that can run. The details are now only in exactly how this is done
> > so as to replace the ROB with as low power, low area, and low complexity as possible.
> > Along these lines we have FlowForward, CPR (Checkpoint Processing
> > and Recovery) and its successor/amplification
> > CFP (Continual Flow Processing) and now I see DOE, Disjoint Out-of-Order Execution
> > http://j92a21b.ee.ncku.edu.tw/broad/report100/2012-12-24/Disjoint%20OOO%20Execution%20Proc%2012.pdf
> >
> > Looked at from a thousand miles up, this last one, in particular,
> > sounds somewhat like Apple's infamous MacroScalar
> > stuff I mentioned in my last post... The details and the concentration differs, sure, but the abstract view
> > seems, in all these cases, to be to create, on the fly, long long LONG chains of instructions such that all
> > instructions in a chain are dependent, but the various chains
> > are independent of each other (except to the extent
> > that they fork from the occasional starting point, and join
> > again at join point). Once you have these chains,
> > it's a somewhat orthogonal question whether you run them
> > on the same "CPU" (CFP, FlowForward), on kinda sorta
> > but not quite the same CPU (MacroScalar), or different CPUs (kinda sorta the DOE stuff).
> >
> > Does anyone have an opinion on how real this stuff is? The CPR/CFP/DOE chain of ideas is
> > based on people at Intel, but that obviously doesn't mean Intel are ready to bet on it.
> > Part of me thinks it would take working silicon from a university (kinda like
> > SPARC/MIPS in the 80s) to really validate the idea and make it worth swapping
> > in for the tried and trusted OoO ROB engines that everyone uses today.
> > And part of me hopes that Apple, as the one player in this space that has not been burned by over-ambition
> > on the CPU front, might just be audacious enough to look
> > at these numbers ("hmm, we can get a CPU that's about
> > 50% faster than our existing A7, in smaller area and lower
> > power, and that will scale well to higher frequencies.
> > Hell, let's take on Intel and ARM head-on and go into the business of selling CPUs to everyone")
> > Though I'd be just as happy if nVidia or Qualcomm or AMD were desperate enough to make
> > a splash that they took these ideas and ran with them. I've been looking at a bunch of
> > ideas for how to handle memory latency, and this collection seem the most promising.
> >
> > What worries me is that the gap between our optimized OoO engines today and the retooling you'd need for
> > these alternative ideas is so large that it's not easy to get from here to there. I THINK you could do it
> > in stages by starting with a simple (hah!) OoO core like an ARM9 or ARM15 and initially just replacing the
> > ROB with checkpoints. With that working and giving you, say, 20%, you could then, next generation, replace
> > the instruction window/scheduler with the CFP data buffers,
> > giving you another 20% or so, and another sellable
> > product, then finally add in the DOE weirdness to add a few more percent and speed up multi-threaded apps.
> > But even with this slicing up, each stage is a fairly ambitious engineering project...
>
> The investment in OOO is largely why I like speculative multi-threading - it can
> be used in conjunction with OOO and creates an orthogonal level of parallelism.
>
> David
SpMT is orthogonal, yes, but it seems to lead in the direction I suggest.
In the late 80s a cluster of apparently independent ideas (super scalar, out of order, branch prediction) came together in a single model. You could have used one of them without the others (e.g. Pentium used SS and BP but not OoO) but there was a synergistic effect, and they kinda fitted together.
DOE is basically SpMT, but done assuming a CFP core rather than an OoO core. As I suggested in my path to the future, you can slide these things in one at a time, but once you accept the basic concept --- on the fly I'm going to split my instruction stream into mutually independent sub streams, and run each of those as best I can --- you're led down a path where you start to ask "why am I still paying all the OoO cost when what I really want is a rather different set of underlying data structures keeping this whole mess consistent?"
But maybe you are right, in the sense that the first move we will see (the safest in the sense of building up some of the necessary infrastructure, while you can switch it off if you can't get it working in time for shipping) is just to add SpMT on top of an existing OoO core to goose it by about 15% or so. Not as much as you'd get from "doing things right" but a safe learning path.