By: rwessel (robertwessel.delete@this.yahoo.com), May 31, 2013 9:02 pm
Room: Moderated Discussions
Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on May 31, 2013 6:59 pm wrote:
> rwessel (robertwessel.delete@this.yahoo.com) on May 31, 2013 3:20 pm wrote:
> > Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on May 31, 2013 2:22 pm wrote:
> > >
> > > - Ah, so it simply gives full priority to a second thread to utilize unused execution units? If so; can
> > > you please explain as to why some applications that ARE indeed multi-threaded, but still don't benefit
> > > from multi-threading? It's weird how I find some game benchmarks to either take very little/none at all/a
> > > performance HIT with hyperthreading enabled, when it's for sure that they use multiple threads...
> > >
> >
> > There remains contention for resources. For example the decoders, branch prediction units, reservation
> > and reorder stations, caches, memory, etc. Consider a simple case where the first thread is creating
> > nearly enough memory references (after caching) that the available memory bandwidth is large consumed
> > by the first thread. Adding a second thread won't make any more memory bandwidth available. Likewise,
> > a single thread might cache well, but with two threads running, and both threads competing for a
> > fixed size cache, they may both start generating many more cache misses.
>
> Oh-- perfect! I understand all you have said now! Thank you very much!
>
> Just one last little thing; why is this only done for two threads and not a variable amount
> of threads? Was it done for some sort of balancing; that having more than two threads would
> create way too much stress on the resources of one core? So they chose two as a middle ground?
It depends on the expected workloads, and the rest of the CPU's design.
All of Intel's x86 SMT implementations have been two thread, and I think all of the IPF ones have been two thread as well, although I could be misremembering Poulson's specs. OTOH, both Sun/Oracle (UltrasSPARC T1 with four, T2 with eight) and IBM (POWER7 - four) have implement larger numbers.
Supporting higher numbers is not free, each active context must (approximately) duplicate the entire architected state of the processor. And as already discussed, more active threads contend for (the now shared resources). Not least are the OOO resources (like the rename registers). But if you have a simple in-order processor, that's going to be running a lot of branchy code and thus spending a ton of time waiting for the memory subsystem, more threads will increase the number of parallel memory access possible. So that route (UltraSPARC T1/T2, Intel Silverthorne Atom) is an easy way to increase *throughput*, but leave you with individually slow processors (so such a CPU will run a web server well, but probably not a compiler). IBM's POWER7 is certainly neither simple or in-order, but does spend a lot of time executing very branchy commercial code, and is usually equipped with a very hefty memory subsystem (include relatively massive caches), so SMT makes sense there as well (and POWER7 cores can be configured to run one, two or four-way multithreading, depending on the expected workload).
There is no doubt Intel could produce an x86 core with four-way SMT, but I expect that there would be relative few workloads where it would have a positive impact. To make it generally useful, they'd probably need to add a POWER7 scale cache and memory subsystem to the device, and that's not going to be within Intel's usual power and cost goals for chips. IBM, OTOH, can put four 300W+ CPU dies with 1K+ I/Os on an MCM, largely because they're building the whole system.
> rwessel (robertwessel.delete@this.yahoo.com) on May 31, 2013 3:20 pm wrote:
> > Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on May 31, 2013 2:22 pm wrote:
> > >
> > > - Ah, so it simply gives full priority to a second thread to utilize unused execution units? If so; can
> > > you please explain as to why some applications that ARE indeed multi-threaded, but still don't benefit
> > > from multi-threading? It's weird how I find some game benchmarks to either take very little/none at all/a
> > > performance HIT with hyperthreading enabled, when it's for sure that they use multiple threads...
> > >
> >
> > There remains contention for resources. For example the decoders, branch prediction units, reservation
> > and reorder stations, caches, memory, etc. Consider a simple case where the first thread is creating
> > nearly enough memory references (after caching) that the available memory bandwidth is large consumed
> > by the first thread. Adding a second thread won't make any more memory bandwidth available. Likewise,
> > a single thread might cache well, but with two threads running, and both threads competing for a
> > fixed size cache, they may both start generating many more cache misses.
>
> Oh-- perfect! I understand all you have said now! Thank you very much!
>
> Just one last little thing; why is this only done for two threads and not a variable amount
> of threads? Was it done for some sort of balancing; that having more than two threads would
> create way too much stress on the resources of one core? So they chose two as a middle ground?
It depends on the expected workloads, and the rest of the CPU's design.
All of Intel's x86 SMT implementations have been two thread, and I think all of the IPF ones have been two thread as well, although I could be misremembering Poulson's specs. OTOH, both Sun/Oracle (UltrasSPARC T1 with four, T2 with eight) and IBM (POWER7 - four) have implement larger numbers.
Supporting higher numbers is not free, each active context must (approximately) duplicate the entire architected state of the processor. And as already discussed, more active threads contend for (the now shared resources). Not least are the OOO resources (like the rename registers). But if you have a simple in-order processor, that's going to be running a lot of branchy code and thus spending a ton of time waiting for the memory subsystem, more threads will increase the number of parallel memory access possible. So that route (UltraSPARC T1/T2, Intel Silverthorne Atom) is an easy way to increase *throughput*, but leave you with individually slow processors (so such a CPU will run a web server well, but probably not a compiler). IBM's POWER7 is certainly neither simple or in-order, but does spend a lot of time executing very branchy commercial code, and is usually equipped with a very hefty memory subsystem (include relatively massive caches), so SMT makes sense there as well (and POWER7 cores can be configured to run one, two or four-way multithreading, depending on the expected workload).
There is no doubt Intel could produce an x86 core with four-way SMT, but I expect that there would be relative few workloads where it would have a positive impact. To make it generally useful, they'd probably need to add a POWER7 scale cache and memory subsystem to the device, and that's not going to be within Intel's usual power and cost goals for chips. IBM, OTOH, can put four 300W+ CPU dies with 1K+ I/Os on an MCM, largely because they're building the whole system.