By: Brett (ggtgp.delete@this.yahoo.com), August 15, 2021 11:38 pm
Room: Moderated Discussions
See the Texas Instruments C64 which has two banks enabling wider execution without excessive porting requirements.
Because all the banks are separate with their own rename and porting each bank can run 6 wide giving a huge leap in top performance.
My approach has 5 banks, primary, secondary and three more that are mapped from the old vector register file.
Three or four bits in the instruction pick which collection of banks the instruction applies to; primary, secondary, both, all but primary, etc.
Instructions include which banks and whether a splat is involved, with most cross bank communication being via splats. Many loads will be load pair splats so all addressing can take place in the first two banks.
Add.12 r2,r17,r22 ; first two banks run same instruction.
Add.1s2 r2,r17,r22 ; add bank 1 but also splat result banks 2.
Add.1sA r2,r17,r22 ; add bank 1 but also splat result to all banks.
This instruction set enables easy software unrolling of loops across the banks just using the bank tagging bits in the instructions. Typically the primary bank just doing load/stores and loop count, with the four other banks doing a four way unroll on code that is otherwise resistant to vectorization. 30 wide execution instead of 6.
For backward compatibility with x86 or RISC just add a decoder for that arch.
Thoughts and criticism?
Because all the banks are separate with their own rename and porting each bank can run 6 wide giving a huge leap in top performance.
My approach has 5 banks, primary, secondary and three more that are mapped from the old vector register file.
Three or four bits in the instruction pick which collection of banks the instruction applies to; primary, secondary, both, all but primary, etc.
Instructions include which banks and whether a splat is involved, with most cross bank communication being via splats. Many loads will be load pair splats so all addressing can take place in the first two banks.
Add.12 r2,r17,r22 ; first two banks run same instruction.
Add.1s2 r2,r17,r22 ; add bank 1 but also splat result banks 2.
Add.1sA r2,r17,r22 ; add bank 1 but also splat result to all banks.
This instruction set enables easy software unrolling of loops across the banks just using the bank tagging bits in the instructions. Typically the primary bank just doing load/stores and loop count, with the four other banks doing a four way unroll on code that is otherwise resistant to vectorization. 30 wide execution instead of 6.
For backward compatibility with x86 or RISC just add a decoder for that arch.
Thoughts and criticism?