By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), May 11, 2013 6:51 pm
Room: Moderated Discussions
RichardC (tich.delete@this.pobox.com) on May 11, 2013 4:39 pm wrote:
> Ricardo B (ricardo.b.delete@this.xxxxx.xx) on May 11, 2013 8:07 am wrote:
>> SMT is not really a compromise between client vs server, but on application types.
>>
>> Modern OoO CPU cores have massive execution resources to squeeze
>> out every last inch of single thread performance.
>
> Up to a point. But if you take out the extra logic and registers
> needed to support SMT, you'd be able to clock the core a little faster.
> Maybe not *much* faster, but a little. SMT can't possibly be free.
> And some workloads don't benefit from it.
If the extra logic is not in a critical path, the performance cost would seem to be limited to energy efficiency benefits (e.g., that portion of the logic might otherwise be further slowed to be more power efficient). I suspect that for many aspects this cost is substantially smaller than variations from manufacturing variability.
Obviously some of the extra resources can be used in single-threaded mode. E.g., it might be practical to use a second thread's RAT as a (relatively expensive compared to a single-ported RAM, even if Itanium-like port-sharing tricks were used) checkpoint for low-confidence branches in single-threaded mode or (for coarse-grained SoEMT) it might be practical to use special low-leakage memory to store an inactive thread (adding a modest amount of otherwise unnecessary logic to support swapping and when entering low-power mode requiring one thread to be evicted if in mulithreaded mode) or reducing total speculative depth when in multithreading mode (with the extra ILP derived from TLP the reduction is speculation might be a worthwhile tradeoff). (Register file design tricks like that used by the multithreaded Itaniums, where access ports are shared by a pair of registers, could be used for storing checkpointed state [potentially to support transactional memory].)
[snip]
> > But on clients, applications like compilers, game physics and AI, etc, also have similar issues.
>
> 99% of desktop/laptop users are not running compilers at all. As for
> gaming, that's also rather a niche, and in any case I'm skeptical about
> whether current game engines show much benefit from running on 4C/8T
> rather than 4C/4T. Latency matters a lot for gaming, and I'm not at
> all sure that 8 slow threads are better than 4 fast ones.
>
> Most desktops/laptops are most running web browsers and office apps
> (word processing, spreadsheet etc). Which don't exploit many
> threads very effectively, if at all.
This argument can be criticized from two directions. First, there is some potential for "ordinary" applications to increase the use of threading. This could be made more probable with better tools and better software and hardware interfaces (which are at least being considered). Being able to spawn a thread with a cost similar to a function call would seem likely to significantly increase the potential utility of multithreading.
Second, support for SMT in particular would facilitate implementation of hardware speculative multithreading, so that in some cases even single-threaded code could have a performance boost from using the hardware infrastructure provided for software-managed multithreading.
The flexibility of resource use and lower overhead communication makes SMT and fine-grained MT relatively attractive for short-duration thread spawning. (The tradeoffs for the degree of coupling among threads/cores are likely complex and workload dependent. Dark silicon principles would tend to encourage providing weaker cores for highly threaded workloads or workloads with low performance demands [like ARM's big.LITTLE], but finer-grained workload variability and communication costs can make such a design less attractive as might high fixed costs at moderate volume [where a moderate increase in fixed costs that can substantially increase volume might be attractive].)
[snip]
> I'm not saying the resulting chips are bad; I'm just saying that it
> would be really interesting to see what Intel's architects could
> deliver if they made a 4C/4T desktop chip without worrying about
> server workloads.
There are also desktop/laptop compromises (or did you mean "client computer CPU" when you wrote "desktop chip"?) and budget-focused/power-user compromises.
While the complexity budget currently used for multithreading could be spent to improve single-threaded performance, I suspect that Intel is so far into the region of diminishing returns that such would not provide a great benefit to performance.
Similarly, decreasing register count by (e.g.) 10% and support for memory-level parallelism by (e.g.) 5% might not improve energy efficiency/performance dramatically. The pressure of energy-efficiency optimizations (even when used for improving performance) may move design choices toward a somewhat common optimum; if so, the benefit of dropping SMT might be unexpectedly low.
For budget-focused CPUs, dropping multithreading would seem to have more substantial benefits since the relative overhead in complexity and hardware resources would be greater. (However, I receive the impression that you are asking about high-performance, lowish-thread-count CPUs for performance sensitive workloads that do not substantially exploit multithreading.) If these budget-focused CPUs had to cover all the fixed costs of developing tools and experience for implementing multithreading, multithreading would be especially unattractive for such. (A similar argument could be made for OoO execution. If in-order is good enough and all the learning curve costs are paid by the low end, then implementing OoO execution may be unattractive even if it could provide a 10% improvement in one's "high-end" two-wide 8-stage "hyperpipelined" cores. As general familiarity with advanced implementation techniques for OoO increases and opportunities arise to expand one's product diversity, such techniques might tend to leak into lower-end products.)
My standard disclaimer: I am not a computer architect, an academic in this field, or even a programmer; so all above comments should be taken as hand-wavey--even if, as I think they are, reasonable--arguments, especially since I am an admitted fan of multithreading.
> Ricardo B (ricardo.b.delete@this.xxxxx.xx) on May 11, 2013 8:07 am wrote:
>> SMT is not really a compromise between client vs server, but on application types.
>>
>> Modern OoO CPU cores have massive execution resources to squeeze
>> out every last inch of single thread performance.
>
> Up to a point. But if you take out the extra logic and registers
> needed to support SMT, you'd be able to clock the core a little faster.
> Maybe not *much* faster, but a little. SMT can't possibly be free.
> And some workloads don't benefit from it.
If the extra logic is not in a critical path, the performance cost would seem to be limited to energy efficiency benefits (e.g., that portion of the logic might otherwise be further slowed to be more power efficient). I suspect that for many aspects this cost is substantially smaller than variations from manufacturing variability.
Obviously some of the extra resources can be used in single-threaded mode. E.g., it might be practical to use a second thread's RAT as a (relatively expensive compared to a single-ported RAM, even if Itanium-like port-sharing tricks were used) checkpoint for low-confidence branches in single-threaded mode or (for coarse-grained SoEMT) it might be practical to use special low-leakage memory to store an inactive thread (adding a modest amount of otherwise unnecessary logic to support swapping and when entering low-power mode requiring one thread to be evicted if in mulithreaded mode) or reducing total speculative depth when in multithreading mode (with the extra ILP derived from TLP the reduction is speculation might be a worthwhile tradeoff). (Register file design tricks like that used by the multithreaded Itaniums, where access ports are shared by a pair of registers, could be used for storing checkpointed state [potentially to support transactional memory].)
[snip]
> > But on clients, applications like compilers, game physics and AI, etc, also have similar issues.
>
> 99% of desktop/laptop users are not running compilers at all. As for
> gaming, that's also rather a niche, and in any case I'm skeptical about
> whether current game engines show much benefit from running on 4C/8T
> rather than 4C/4T. Latency matters a lot for gaming, and I'm not at
> all sure that 8 slow threads are better than 4 fast ones.
>
> Most desktops/laptops are most running web browsers and office apps
> (word processing, spreadsheet etc). Which don't exploit many
> threads very effectively, if at all.
This argument can be criticized from two directions. First, there is some potential for "ordinary" applications to increase the use of threading. This could be made more probable with better tools and better software and hardware interfaces (which are at least being considered). Being able to spawn a thread with a cost similar to a function call would seem likely to significantly increase the potential utility of multithreading.
Second, support for SMT in particular would facilitate implementation of hardware speculative multithreading, so that in some cases even single-threaded code could have a performance boost from using the hardware infrastructure provided for software-managed multithreading.
The flexibility of resource use and lower overhead communication makes SMT and fine-grained MT relatively attractive for short-duration thread spawning. (The tradeoffs for the degree of coupling among threads/cores are likely complex and workload dependent. Dark silicon principles would tend to encourage providing weaker cores for highly threaded workloads or workloads with low performance demands [like ARM's big.LITTLE], but finer-grained workload variability and communication costs can make such a design less attractive as might high fixed costs at moderate volume [where a moderate increase in fixed costs that can substantially increase volume might be attractive].)
[snip]
> I'm not saying the resulting chips are bad; I'm just saying that it
> would be really interesting to see what Intel's architects could
> deliver if they made a 4C/4T desktop chip without worrying about
> server workloads.
There are also desktop/laptop compromises (or did you mean "client computer CPU" when you wrote "desktop chip"?) and budget-focused/power-user compromises.
While the complexity budget currently used for multithreading could be spent to improve single-threaded performance, I suspect that Intel is so far into the region of diminishing returns that such would not provide a great benefit to performance.
Similarly, decreasing register count by (e.g.) 10% and support for memory-level parallelism by (e.g.) 5% might not improve energy efficiency/performance dramatically. The pressure of energy-efficiency optimizations (even when used for improving performance) may move design choices toward a somewhat common optimum; if so, the benefit of dropping SMT might be unexpectedly low.
For budget-focused CPUs, dropping multithreading would seem to have more substantial benefits since the relative overhead in complexity and hardware resources would be greater. (However, I receive the impression that you are asking about high-performance, lowish-thread-count CPUs for performance sensitive workloads that do not substantially exploit multithreading.) If these budget-focused CPUs had to cover all the fixed costs of developing tools and experience for implementing multithreading, multithreading would be especially unattractive for such. (A similar argument could be made for OoO execution. If in-order is good enough and all the learning curve costs are paid by the low end, then implementing OoO execution may be unattractive even if it could provide a 10% improvement in one's "high-end" two-wide 8-stage "hyperpipelined" cores. As general familiarity with advanced implementation techniques for OoO increases and opportunities arise to expand one's product diversity, such techniques might tend to leak into lower-end products.)
My standard disclaimer: I am not a computer architect, an academic in this field, or even a programmer; so all above comments should be taken as hand-wavey--even if, as I think they are, reasonable--arguments, especially since I am an admitted fan of multithreading.