By: rwessel (robertwessel.delete@this.yahoo.com), June 2, 2013 11:54 pm
Room: Moderated Discussions
Paul A. Clayton (paaronclayton.delete@this.gmail.com) on June 1, 2013 10:11 am wrote:
> rwessel (robertwessel.delete@this.yahoo.com) on May 31, 2013 9:02 pm wrote:
> > Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on May 31, 2013 6:59 pm wrote:
> [snip]
> >> Just one last little thing; why is this only done for two threads and not a variable amount
> >> of threads? Was it done for some sort of balancing; that having more than two threads would
> >> create way too much stress on the resources of one core? So they chose two as a middle ground?
> >
> >
> > It depends on the expected workloads, and the rest of the CPU's design.
> >
> > All of Intel's x86 SMT implementations have been two thread, and I think all of the IPF ones have
> > been two thread as well, although I could be misremembering Poulson's specs. OTOH, both Sun/Oracle
> > (UltrasSPARC T1 with four, T2 with eight) and IBM (POWER7 - four) have implement larger numbers.
>
> Just to quibble a bit, Itanium uses Switch-on-Event-MultiThreading and UltraSPARC T1 used Fine-Grained
> MultiThreading (IIRC). Both T2 and POWER7 (in SMT4 mode) "cheat" a bit in exploiting clustering/partitioning;
> the full resources of the core are not available to a thread.
>
> > Supporting higher numbers is not free, each active context must (approximately) duplicate the entire
> > architected state of the processor.
>
> This is part of what makes POWER7's SMT4 mode kind of neat (Boasting: I thought of this technique independently
> [just over 3 years ago, though presumably several years later than IBM did].); it exploits the register
> file duplication used to reduce the number of read ports in order to support a doubling of the number of threads
> without having to double the number of register file entries. Halving the potential (but not actual) instruction-level
> parallelism to accomplish this can be an acceptable penalty for low ILP workloads.
Physically splitting the register file is not that new an idea (some of the later Alpha's did that, although not in an MT context, and different types of registers have often been physically segregated), but does play well in POWER's SMT, particularly in the 1/2/4 thread configurations allowed (one thread gets to use only one register file's worth of registers, but has full access to the remainder of the chips' resources, two threads each get a complete copy of the register file, but have to share the rest of the chip, 4 have to share the RF too).
In many ways, it's an easy tradeoff if you can figure some way to partition the register file (and by-thread is particularly simple). RF size is roughly proportional to square of the number of ports (and obviously the number of registers). Assuming you though N registers and P ports was a good match for a single thread, you could implement a 2N and 2P register file for a 2T machine, or two 2N/1P register files, which would still give a single thread access to 2N physical registers, would probably be faster, and yet would only take half the area of the single 2N/2P design.
> rwessel (robertwessel.delete@this.yahoo.com) on May 31, 2013 9:02 pm wrote:
> > Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on May 31, 2013 6:59 pm wrote:
> [snip]
> >> Just one last little thing; why is this only done for two threads and not a variable amount
> >> of threads? Was it done for some sort of balancing; that having more than two threads would
> >> create way too much stress on the resources of one core? So they chose two as a middle ground?
> >
> >
> > It depends on the expected workloads, and the rest of the CPU's design.
> >
> > All of Intel's x86 SMT implementations have been two thread, and I think all of the IPF ones have
> > been two thread as well, although I could be misremembering Poulson's specs. OTOH, both Sun/Oracle
> > (UltrasSPARC T1 with four, T2 with eight) and IBM (POWER7 - four) have implement larger numbers.
>
> Just to quibble a bit, Itanium uses Switch-on-Event-MultiThreading and UltraSPARC T1 used Fine-Grained
> MultiThreading (IIRC). Both T2 and POWER7 (in SMT4 mode) "cheat" a bit in exploiting clustering/partitioning;
> the full resources of the core are not available to a thread.
>
> > Supporting higher numbers is not free, each active context must (approximately) duplicate the entire
> > architected state of the processor.
>
> This is part of what makes POWER7's SMT4 mode kind of neat (Boasting: I thought of this technique independently
> [just over 3 years ago, though presumably several years later than IBM did].); it exploits the register
> file duplication used to reduce the number of read ports in order to support a doubling of the number of threads
> without having to double the number of register file entries. Halving the potential (but not actual) instruction-level
> parallelism to accomplish this can be an acceptable penalty for low ILP workloads.
Physically splitting the register file is not that new an idea (some of the later Alpha's did that, although not in an MT context, and different types of registers have often been physically segregated), but does play well in POWER's SMT, particularly in the 1/2/4 thread configurations allowed (one thread gets to use only one register file's worth of registers, but has full access to the remainder of the chips' resources, two threads each get a complete copy of the register file, but have to share the rest of the chip, 4 have to share the RF too).
In many ways, it's an easy tradeoff if you can figure some way to partition the register file (and by-thread is particularly simple). RF size is roughly proportional to square of the number of ports (and obviously the number of registers). Assuming you though N registers and P ports was a good match for a single thread, you could implement a 2N and 2P register file for a 2T machine, or two 2N/1P register files, which would still give a single thread access to 2N physical registers, would probably be faster, and yet would only take half the area of the single 2N/2P design.