By: Maynard Handley (name99.delete@this.name99.org), November 18, 2020 2:12 pm
Room: Moderated Discussions
David Hess (davidwhess.delete@this.gmail.com) on November 18, 2020 12:12 pm wrote:
> Maynard Handley (name99.delete@this.name99.org) on November 18, 2020 8:37 am wrote:
> >
> > SMT is a decision to swap something that is cheap and plentiful (space for an *independent*
> > core on the die) with something that is expensive and in extremely short supply (the SRAM
> > that feeds the predictors and caches that give you all that IPC for a particular core).
> >
> > Explain to me why that is a sensible tradeoff...
>
> SMT is a decision to swap something that is expensive and in extremely short supply (power
> hungry logic) with something that is cheap and plentiful (low power SRAM and state).
>
Logic is only power hungry if you're doing it incorrectly.
If state (appropriately configured...) is so cheap then (to give an obvious example) why don't AMD and Intel copy Apple's monster sized caches? That really is the essence of it. You can't simultaneously argue that Apple is getting some sort of "unfair advantage" by having very large caches AND that it would a good design decision for Apple to run its cores in a way that halves the effective size of those caches.
As I have said repeatedly, try to consider the problem from the point of view of TODAY's technology, not the past.
*ALL* SMT gives you is optionality, the option to convert your latency machine into a throughput machine. If that optionality is worthless (perhaps because most of your throughput tasks are done by dedicated silicon, perhaps because you provide lots of small cores for throughput tasks) then why bother? All you are doing is adding a whole bunch of complexity for something you don't need or want!
Apple built the M1 as a 4+4 design as a BUSINESS decision, not as a technical decision. For this year, the low-end is positioned as 4 large cores + 4 energy cores as the equiv of SMT.
They will sell you 8 large core performance cores soon enough if that's what you want. They don't need SMT to slightly boost their throughput!
There are a bunch of technical ways in which something *like* SMT (but differing in ways that are clear to me but not to people who have not thought about the issue) COULD be important at some point. But what I have in mind is not Intel-style SMT -- it shares an address space, it's managed by the user code not by the OS.
The point is to look at where SMT is horribly flawed (security, resource contention) and eliminate those cases while retaining the cases where there are not security concerns, AND the resources are mostly shared rather than contended; basically think fibers. I am not at all opposed to a design implementing fibers.
But I also don't much care because it's so trivial it's just not interesting technologically. You add it at some point when you've figured out a few details (the launch model, the sync model, how to get it into your language and compiler) and you move on.
What I DO care about is not polluting the design with the flaws that result from trying to do x86 (and POWER) style SMT of RANDOM co-running threads.
> Maynard Handley (name99.delete@this.name99.org) on November 18, 2020 8:37 am wrote:
> >
> > SMT is a decision to swap something that is cheap and plentiful (space for an *independent*
> > core on the die) with something that is expensive and in extremely short supply (the SRAM
> > that feeds the predictors and caches that give you all that IPC for a particular core).
> >
> > Explain to me why that is a sensible tradeoff...
>
> SMT is a decision to swap something that is expensive and in extremely short supply (power
> hungry logic) with something that is cheap and plentiful (low power SRAM and state).
>
Logic is only power hungry if you're doing it incorrectly.
If state (appropriately configured...) is so cheap then (to give an obvious example) why don't AMD and Intel copy Apple's monster sized caches? That really is the essence of it. You can't simultaneously argue that Apple is getting some sort of "unfair advantage" by having very large caches AND that it would a good design decision for Apple to run its cores in a way that halves the effective size of those caches.
As I have said repeatedly, try to consider the problem from the point of view of TODAY's technology, not the past.
*ALL* SMT gives you is optionality, the option to convert your latency machine into a throughput machine. If that optionality is worthless (perhaps because most of your throughput tasks are done by dedicated silicon, perhaps because you provide lots of small cores for throughput tasks) then why bother? All you are doing is adding a whole bunch of complexity for something you don't need or want!
Apple built the M1 as a 4+4 design as a BUSINESS decision, not as a technical decision. For this year, the low-end is positioned as 4 large cores + 4 energy cores as the equiv of SMT.
They will sell you 8 large core performance cores soon enough if that's what you want. They don't need SMT to slightly boost their throughput!
There are a bunch of technical ways in which something *like* SMT (but differing in ways that are clear to me but not to people who have not thought about the issue) COULD be important at some point. But what I have in mind is not Intel-style SMT -- it shares an address space, it's managed by the user code not by the OS.
The point is to look at where SMT is horribly flawed (security, resource contention) and eliminate those cases while retaining the cases where there are not security concerns, AND the resources are mostly shared rather than contended; basically think fibers. I am not at all opposed to a design implementing fibers.
But I also don't much care because it's so trivial it's just not interesting technologically. You add it at some point when you've figured out a few details (the launch model, the sync model, how to get it into your language and compiler) and you move on.
What I DO care about is not polluting the design with the flaws that result from trying to do x86 (and POWER) style SMT of RANDOM co-running threads.