By: Patrick Chase (patrickjchase.delete@this.gmail.com), May 19, 2013 4:19 pm
Room: Moderated Discussions
Exophase (exophase.delete@this.gmail.com) on May 18, 2013 10:33 am wrote:
> On TI's C6x DSPs it's hard to write high performance code that doesn't use registers
> under the shadows of their execution due to having such heavily pipelining and wide
> execution units (and fully single cycle throughput with no interlocks).
IMO the issue with C6X is that it's register-poor as VLIWs go. TI committed the same basic architecural blunder as ARM: They expended instruction encoding bits on predication, that might instead have been better used to enable a larger set of architectural registers.
For those who aren't familiar with the architecture, C6x consists of a pair of 4-wide VLIW "clusters", each with its own register file. If memory serves, one functional unit in each cluster can access the other cluster's RF (the other 3 can only access the local RF). The pipeline is up to 17 stages deep, with a load->use delay from L1 of 4-5 clocks (depending on which specific iteration we're talking about). Non-multiply integer instructions can be issued back-to-back, while FP ops have ~4 clocks of latency if memory serves. The first iterations of the uarch had 16 GPRs per cluster, 32 total. They subsequently increased that to 32 GPRs per cluster, 64 total. The architecture is heavily predicated.
If you look at it on a per-cluster basis that means you have 4-8 GPRs per pipeline. You are therefore absolutely correct: If you didn't use registers "under" the load/FP/branch delay shadows then you would very quickly run out of GPRs and be forced to stall.
Architects other than TI have addressed the same issue by providing more GPRs at the expense of predication. A couple examples:
- ST-2xx (architected by Josh Fisher himself) has 64 GPRs for 4 functional units. Predication/speculation is limited to dismissible loads and conditional moves.
- Qualcomm Hexagon (at least the versions I'm familiar with) has 32 GPRs for 4 units, but with a fairly short pipeline (load->use delay of 2-3 clocks if memory serves). It saves instruction encoding bits by using partial predication, i.e. some instructions have predicated forms while others don't.
> On TI's C6x DSPs it's hard to write high performance code that doesn't use registers
> under the shadows of their execution due to having such heavily pipelining and wide
> execution units (and fully single cycle throughput with no interlocks).
IMO the issue with C6X is that it's register-poor as VLIWs go. TI committed the same basic architecural blunder as ARM: They expended instruction encoding bits on predication, that might instead have been better used to enable a larger set of architectural registers.
For those who aren't familiar with the architecture, C6x consists of a pair of 4-wide VLIW "clusters", each with its own register file. If memory serves, one functional unit in each cluster can access the other cluster's RF (the other 3 can only access the local RF). The pipeline is up to 17 stages deep, with a load->use delay from L1 of 4-5 clocks (depending on which specific iteration we're talking about). Non-multiply integer instructions can be issued back-to-back, while FP ops have ~4 clocks of latency if memory serves. The first iterations of the uarch had 16 GPRs per cluster, 32 total. They subsequently increased that to 32 GPRs per cluster, 64 total. The architecture is heavily predicated.
If you look at it on a per-cluster basis that means you have 4-8 GPRs per pipeline. You are therefore absolutely correct: If you didn't use registers "under" the load/FP/branch delay shadows then you would very quickly run out of GPRs and be forced to stall.
Architects other than TI have addressed the same issue by providing more GPRs at the expense of predication. A couple examples:
- ST-2xx (architected by Josh Fisher himself) has 64 GPRs for 4 functional units. Predication/speculation is limited to dismissible loads and conditional moves.
- Qualcomm Hexagon (at least the versions I'm familiar with) has 32 GPRs for 4 units, but with a fairly short pipeline (load->use delay of 2-3 clocks if memory serves). It saves instruction encoding bits by using partial predication, i.e. some instructions have predicated forms while others don't.