By: Stubabe (Stubabe.delete@this.nospam.com), May 14, 2013 12:09 pm
Room: Moderated Discussions
RichardC (tich.delete@this.pobox.com) on May 11, 2013 4:39 pm wrote:
> Ricardo B (ricardo.b.delete@this.xxxxx.xx) on May 11, 2013 8:07 am wrote:
> > SMT is not really a compromise between client vs server, but on application types.
> >
> > Modern OoO CPU cores have massive execution resources to squeeze
> > out every last inch of single thread performance.
>
> Up to a point. But if you take out the extra logic and registers
> needed to support SMT, you'd be able to clock the core a little faster.
> Maybe not *much* faster, but a little. SMT can't possibly be free.
> And some workloads don't benefit from it.
>
And some do, so what? Why not microcode all but the top 5% of instructions -see how far that gets you...
Also, I don't see a convincing case for a sans-SMT chip being any faster at all. Most resources of SMT are really just sharing of common OoO resources so those resources can be redirected to single threaded workloads. Replicated resources can be clock/power gated when not in use so add little if anything to single thread power budgets.
In fact, the situation is even better today since both SMT resource reuse and clock/power gateing are technologies that have steadily improved over the recent generations of Intel CPUs. For example, the micro-op queue in Ivy bridge is now no longer statically replicated and the full capacity can be made available to the sole active logical CPU thread. So I really don't see SMT adding anything more than a rounding error to the power budget when a single logical CPU is active. So unless you are suggesting that SMT related hardware gate delays are limiting clock speeds, something I view as very unlikely since current chips are clearly power limited.
As for die area when Intel introduced SMT in the P4 they claimed it added less than 5% to the die - it is likely to be far less now. Is throwing that (very slight) die area at adding another core to chip really going to help in situations that SMT wouldn't?
Multi-core has diminishing returns too: hardware engineers like to complain about serialisation delays hurting the scaling of their designs so we need multi-core to use all those transistors. Unfortunately, they seem to conveniently forget that serialisation hurts software scalability too. Low IPC threads on the same physical core will see much lower synchronisation overheads than those running on different cores, which may benefit sufficiently fine-grained threading (fine grained is often needed to fully extract parallelism in non trivial cases) more than the cost of sharing a physical core. Certainly, there are threading strategies that can only be conceived (such as prefetch threads) in an SMT like execution environment.
OK what about using that 5% to make a single fatter core?
Single thread performance is dependant to both clock speed and IPC. The increasing the former carries a significant power overhead as it requires more pipeline stages or more voltage. It also increases relative memory latency which in turn requires larger OoO windows and load/store queues etc. to hide it. To increase IPC you need wider decode and execute bandwidth AND OoO windows and load/store queues etc. With Haswell Intel now has an 8-issue execution core but they only claim a typical IPC increase of up to 10% from increasing the ALU issue width 25% and significantly improving the memory subsystem. Single thread performance is stagnating and not for want of trying.
So we can keep throwing more ALUs, memory ports, or even more cores on a chip while trying to eek out slightly higher top turbo speed bins all for diminishing gains in a diminishing % of software. But keeping/adding a feature that if not free is as close to free as makes no odds is an issue because it doesn't help all software? The whole point of SMT is it borrows the very mechanisms that we need for OoO work in the 1st place (renaming, superscalar, dependency tracking) so why is it so hard to believe it comes at near zero cost (beyond extra permanent register state) on a complex OoO chip? It's why I believe SMT was a bad fit in the original Atom as it added complexity that wasn't there, but on the other hand SMT in the Atom was just a crutch to prop up a poor design.
The assumption here is that without SMT Intel might have invested more in single threaded performance. I would say the opposite, since without throughput workloads exploiting wide chips via SMT I doubt Intel's architects could have justified wide power efficient designs like Sandybridge onwards since the gains would have been to small in too many cases. But with SMT they can continue to throw resources at fast fat cores (as it befits more than just one set of corner cases) rather than just giving us a die full of gutless cores and basically just another GPU.
> Ricardo B (ricardo.b.delete@this.xxxxx.xx) on May 11, 2013 8:07 am wrote:
> > SMT is not really a compromise between client vs server, but on application types.
> >
> > Modern OoO CPU cores have massive execution resources to squeeze
> > out every last inch of single thread performance.
>
> Up to a point. But if you take out the extra logic and registers
> needed to support SMT, you'd be able to clock the core a little faster.
> Maybe not *much* faster, but a little. SMT can't possibly be free.
> And some workloads don't benefit from it.
>
And some do, so what? Why not microcode all but the top 5% of instructions -see how far that gets you...
Also, I don't see a convincing case for a sans-SMT chip being any faster at all. Most resources of SMT are really just sharing of common OoO resources so those resources can be redirected to single threaded workloads. Replicated resources can be clock/power gated when not in use so add little if anything to single thread power budgets.
In fact, the situation is even better today since both SMT resource reuse and clock/power gateing are technologies that have steadily improved over the recent generations of Intel CPUs. For example, the micro-op queue in Ivy bridge is now no longer statically replicated and the full capacity can be made available to the sole active logical CPU thread. So I really don't see SMT adding anything more than a rounding error to the power budget when a single logical CPU is active. So unless you are suggesting that SMT related hardware gate delays are limiting clock speeds, something I view as very unlikely since current chips are clearly power limited.
As for die area when Intel introduced SMT in the P4 they claimed it added less than 5% to the die - it is likely to be far less now. Is throwing that (very slight) die area at adding another core to chip really going to help in situations that SMT wouldn't?
Multi-core has diminishing returns too: hardware engineers like to complain about serialisation delays hurting the scaling of their designs so we need multi-core to use all those transistors. Unfortunately, they seem to conveniently forget that serialisation hurts software scalability too. Low IPC threads on the same physical core will see much lower synchronisation overheads than those running on different cores, which may benefit sufficiently fine-grained threading (fine grained is often needed to fully extract parallelism in non trivial cases) more than the cost of sharing a physical core. Certainly, there are threading strategies that can only be conceived (such as prefetch threads) in an SMT like execution environment.
OK what about using that 5% to make a single fatter core?
Single thread performance is dependant to both clock speed and IPC. The increasing the former carries a significant power overhead as it requires more pipeline stages or more voltage. It also increases relative memory latency which in turn requires larger OoO windows and load/store queues etc. to hide it. To increase IPC you need wider decode and execute bandwidth AND OoO windows and load/store queues etc. With Haswell Intel now has an 8-issue execution core but they only claim a typical IPC increase of up to 10% from increasing the ALU issue width 25% and significantly improving the memory subsystem. Single thread performance is stagnating and not for want of trying.
So we can keep throwing more ALUs, memory ports, or even more cores on a chip while trying to eek out slightly higher top turbo speed bins all for diminishing gains in a diminishing % of software. But keeping/adding a feature that if not free is as close to free as makes no odds is an issue because it doesn't help all software? The whole point of SMT is it borrows the very mechanisms that we need for OoO work in the 1st place (renaming, superscalar, dependency tracking) so why is it so hard to believe it comes at near zero cost (beyond extra permanent register state) on a complex OoO chip? It's why I believe SMT was a bad fit in the original Atom as it added complexity that wasn't there, but on the other hand SMT in the Atom was just a crutch to prop up a poor design.
The assumption here is that without SMT Intel might have invested more in single threaded performance. I would say the opposite, since without throughput workloads exploiting wide chips via SMT I doubt Intel's architects could have justified wide power efficient designs like Sandybridge onwards since the gains would have been to small in too many cases. But with SMT they can continue to throw resources at fast fat cores (as it befits more than just one set of corner cases) rather than just giving us a die full of gutless cores and basically just another GPU.