By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), June 3, 2013 3:53 am
Room: Moderated Discussions
rwessel (robertwessel.delete@this.yahoo.com) on June 2, 2013 11:54 pm wrote:
[snip]
> Physically splitting the register file is not that new an idea (some of the later Alpha's did
> that, although not in an MT context, and different types of registers have often been physically
> segregated), but does play well in POWER's SMT, particularly in the 1/2/4 thread configurations
> allowed (one thread gets to use only one register file's worth of registers, but has full access
> to the remainder of the chips' resources, two threads each get a complete copy of the register
> file, but have to share the rest of the chip, 4 have to share the RF too).
Yes, the comp.arch post had the subject "SMT exploiting 21264-like clustering?". I did not at all claim that register file clustering was an independent invention of my own, but using such to reduce the resource requirements for SMT was an independent invention. (I even mentioned in the comp.arch post that such "might not be advisable under two simultaneous threads usually" but with four threads it "might be a net gain in many cases". [I also mentioned the possibility of supporting SIMD operations; such might be useful in facilitating register file sharing with an ISA where SIMD register size is twice that of GPRs.])
(I am somewhat proud of my significant ability to think of alternate uses for [and incremental improvements of] established ideas; but I do not seem to have the kind of creativity that generates truly novel ideas. This is probably part of my micro-optimizing mindset.)
> In many ways, it's an easy tradeoff if you can figure some way to partition the register file (and
> by-thread is particularly simple). RF size is roughly proportional to square of the number of ports
> (and obviously the number of registers). Assuming you though N registers and P ports was a good
> match for a single thread, you could implement a 2N and 2P register file for a 2T machine, or two
> 2N/1P register files, which would still give a single thread access to 2N physical registers, would
> probably be faster, and yet would only take half the area of the single 2N/2P design.
Sadly, partitioning within a single thread seems to have significant issues (at least without ISA support). (Using such for checkpointing [deallocating speculatively dead values from one register file cluster] might be more practical than general use. One would only need to deal with partition constraints on a failure of speculation.) Even banking (which avoids routing restrictions present in partitioning) has some issues in utilization, and increasing the number of banks to avoid conflicts increases the routing complexity. (It seems plausible that the same division into separate physical arrays used in caches [for layout flexibility, latency, energy-efficiency, or pipelining support?] might become attractive for register files. Once such physical separate exists, it might be very tempting to exploit it by using banking.)
(Even more challenging--and more interesting--is cache partitioning.)
The exploiting of Sun's 3D register file idea for SoEMT (as in Itanium) is also kind of neat. (Sun proposed the sharing of ports by multiple entries where, at a given time, only one entry in the group could use the ports. Such was intended for SPARC's register windows where blocks of registers change visibility, but it has obvious application to SoEMT/FGMT [and to the use of shadow register sets, usually for fast interrupts, which is nearly a form of SoEMT--incidentally, MIPS MT ASE maps Thread Contexts onto Shadow Register Sets].) This might be considered temporal partitioning or banking.
[snip]
> Physically splitting the register file is not that new an idea (some of the later Alpha's did
> that, although not in an MT context, and different types of registers have often been physically
> segregated), but does play well in POWER's SMT, particularly in the 1/2/4 thread configurations
> allowed (one thread gets to use only one register file's worth of registers, but has full access
> to the remainder of the chips' resources, two threads each get a complete copy of the register
> file, but have to share the rest of the chip, 4 have to share the RF too).
Yes, the comp.arch post had the subject "SMT exploiting 21264-like clustering?". I did not at all claim that register file clustering was an independent invention of my own, but using such to reduce the resource requirements for SMT was an independent invention. (I even mentioned in the comp.arch post that such "might not be advisable under two simultaneous threads usually" but with four threads it "might be a net gain in many cases". [I also mentioned the possibility of supporting SIMD operations; such might be useful in facilitating register file sharing with an ISA where SIMD register size is twice that of GPRs.])
(I am somewhat proud of my significant ability to think of alternate uses for [and incremental improvements of] established ideas; but I do not seem to have the kind of creativity that generates truly novel ideas. This is probably part of my micro-optimizing mindset.)
> In many ways, it's an easy tradeoff if you can figure some way to partition the register file (and
> by-thread is particularly simple). RF size is roughly proportional to square of the number of ports
> (and obviously the number of registers). Assuming you though N registers and P ports was a good
> match for a single thread, you could implement a 2N and 2P register file for a 2T machine, or two
> 2N/1P register files, which would still give a single thread access to 2N physical registers, would
> probably be faster, and yet would only take half the area of the single 2N/2P design.
Sadly, partitioning within a single thread seems to have significant issues (at least without ISA support). (Using such for checkpointing [deallocating speculatively dead values from one register file cluster] might be more practical than general use. One would only need to deal with partition constraints on a failure of speculation.) Even banking (which avoids routing restrictions present in partitioning) has some issues in utilization, and increasing the number of banks to avoid conflicts increases the routing complexity. (It seems plausible that the same division into separate physical arrays used in caches [for layout flexibility, latency, energy-efficiency, or pipelining support?] might become attractive for register files. Once such physical separate exists, it might be very tempting to exploit it by using banking.)
(Even more challenging--and more interesting--is cache partitioning.)
The exploiting of Sun's 3D register file idea for SoEMT (as in Itanium) is also kind of neat. (Sun proposed the sharing of ports by multiple entries where, at a given time, only one entry in the group could use the ports. Such was intended for SPARC's register windows where blocks of registers change visibility, but it has obvious application to SoEMT/FGMT [and to the use of shadow register sets, usually for fast interrupts, which is nearly a form of SoEMT--incidentally, MIPS MT ASE maps Thread Contexts onto Shadow Register Sets].) This might be considered temporal partitioning or banking.