By: Heikki Kultala (heikki.kultal.a.delete@this.gmail.com), March 22, 2021 6:52 pm
Room: Moderated Discussions
Lowest level or the hierarchy is single instruction, which reads data from internal bus/pipeline register, one register or immediate parameter, and produces value in the same internal bus/pipeline register than what it read
Second level hierarchy is a serial bypass-bundle, something like 0-4 instructions. The bundle header contains one register read, number of instructions in the bundle, and target register index where the value from the (last instruction of the) bundle is written to. 0-instruction bundle is just a move. No interrupts/exceptions or side effects may happen inside the bundle, only after it/between them. Whole bundle practically executes atomically.
OoOE can be done freely between the bundles, theoretically bundles could also be split by the HW dynamically.
The next level hierarchy is a basic block. The jump do not have to be the last instruction of the basic block, for example the header for the basic block of a loop body may say that the basic block contains 4 bypass bundles and then branch can be already in the first one, but as the size of the basic block is 4 bundles, all the instructions in all of those 4 bundles get executed. The basic block header may also contain a loop count (either immediate or index to a register containing it). In this case no branch instructions are needed inside the loop and the basic block is still executed N times.
The next level of the hierarchy is microthreading, which is done fully on hardware, no SW overhead of starting threads more than single instrucion for fork and single instruction for join. All the microthreads have same virtual memory mapping, and there are just one instruction which starts another microthread, returning odd value to one thread and even value to another thread, and couple of different instructions to join thread. Some of these thread join instructions also perform some reduce operation on single-register return value data from both micro-threads, for example min, max, or add.
The implementation is always free to execute the microthreads sequentially (common case if all our hardware microthreads are already in use, for example started by outer level function); programmer can write his code/compiler can compile the code like he/it has infinite amount of microthreads available. As the bundles execute atomically, different microthreads can still do things like incrementing the same counter in memory, but as they are allowed to execute sequentially, they are not allowed to wait data from one another microthread because that might cause a deadlock.
This kind of microthreading should be VERY fast for creating new threads and joining existing ones, as it can be done fully on hardware: If there is a free HW microthread available, the ufork instruction just sets the PC and RAT for it(both threads get the same register contents) and activates it, if there are no, the beginning instruction address of the microthread is and GPR contents is put into (HW) queue and no new thread is started yet. When some thread exits, new one is taken from the queue and executed.
The HW is free to even start executing the microthreads in SIMT way.
As there is no limit how many waiting microthreads there may be, there is a hardware-based stack which can spill the PC and GPRs of a waiting microthread into memory into special protected memory location allocated for it by the OS. The only thing the OS does is allocating space for the stack. The same protected stack might also handle function return addresses so that those cannot be overwritten by normal load/store operations.
Second level hierarchy is a serial bypass-bundle, something like 0-4 instructions. The bundle header contains one register read, number of instructions in the bundle, and target register index where the value from the (last instruction of the) bundle is written to. 0-instruction bundle is just a move. No interrupts/exceptions or side effects may happen inside the bundle, only after it/between them. Whole bundle practically executes atomically.
OoOE can be done freely between the bundles, theoretically bundles could also be split by the HW dynamically.
The next level hierarchy is a basic block. The jump do not have to be the last instruction of the basic block, for example the header for the basic block of a loop body may say that the basic block contains 4 bypass bundles and then branch can be already in the first one, but as the size of the basic block is 4 bundles, all the instructions in all of those 4 bundles get executed. The basic block header may also contain a loop count (either immediate or index to a register containing it). In this case no branch instructions are needed inside the loop and the basic block is still executed N times.
The next level of the hierarchy is microthreading, which is done fully on hardware, no SW overhead of starting threads more than single instrucion for fork and single instruction for join. All the microthreads have same virtual memory mapping, and there are just one instruction which starts another microthread, returning odd value to one thread and even value to another thread, and couple of different instructions to join thread. Some of these thread join instructions also perform some reduce operation on single-register return value data from both micro-threads, for example min, max, or add.
The implementation is always free to execute the microthreads sequentially (common case if all our hardware microthreads are already in use, for example started by outer level function); programmer can write his code/compiler can compile the code like he/it has infinite amount of microthreads available. As the bundles execute atomically, different microthreads can still do things like incrementing the same counter in memory, but as they are allowed to execute sequentially, they are not allowed to wait data from one another microthread because that might cause a deadlock.
This kind of microthreading should be VERY fast for creating new threads and joining existing ones, as it can be done fully on hardware: If there is a free HW microthread available, the ufork instruction just sets the PC and RAT for it(both threads get the same register contents) and activates it, if there are no, the beginning instruction address of the microthread is and GPR contents is put into (HW) queue and no new thread is started yet. When some thread exits, new one is taken from the queue and executed.
The HW is free to even start executing the microthreads in SIMT way.
As there is no limit how many waiting microthreads there may be, there is a hardware-based stack which can spill the PC and GPRs of a waiting microthread into memory into special protected memory location allocated for it by the OS. The only thing the OS does is allocating space for the stack. The same protected stack might also handle function return addresses so that those cannot be overwritten by normal load/store operations.