By: Anon (no.delete@this.spam.com), March 23, 2021 8:39 am
Room: Moderated Discussions
Heikki Kultala (heikki.kultal.a.delete@this.gmail.com) on March 22, 2021 6:52 pm wrote:
> Lowest level or the hierarchy is single instruction, which reads data from
> internal bus/pipeline register, one register or immediate parameter, and produces
> value in the same internal bus/pipeline register than what it read
>
> Second level hierarchy is a serial bypass-bundle, something like 0-4 instructions. The bundle
> header contains one register read, number of instructions in the bundle, and target register
> index where the value from the (last instruction of the) bundle is written to. 0-instruction
> bundle is just a move. No interrupts/exceptions or side effects may happen inside the bundle,
> only after it/between them. Whole bundle practically executes atomically.
>
> OoOE can be done freely between the bundles, theoretically
> bundles could also be split by the HW dynamically.
>
> The next level hierarchy is a basic block. The jump do not have to be the last instruction of the basic
> block, for example the header for the basic block of a loop body may say that the basic block contains
> 4 bypass bundles and then branch can be already in the first one, but as the size of the basic block
> is 4 bundles, all the instructions in all of those 4 bundles get executed. The basic block header may
> also contain a loop count (either immediate or index to a register containing it). In this case no branch
> instructions are needed inside the loop and the basic block is still executed N times.
My mental model of a ideal ISA is very similar to what you describe.
> The next level of the hierarchy is microthreading, which is done fully on hardware, no SW overhead of
> starting threads more than single instrucion for fork and single instruction for join. All the microthreads
> have same virtual memory mapping, and there are just one instruction which starts another microthread,
> returning odd value to one thread and even value to another thread, and couple of different instructions
> to join thread. Some of these thread join instructions also perform some reduce operation on single-register
> return value data from both micro-threads, for example min, max, or add.
>
> The implementation is always free to execute the microthreads sequentially (common case if all our
> hardware microthreads are already in use, for example started by outer level function); programmer
> can write his code/compiler can compile the code like he/it has infinite amount of microthreads
> available. As the bundles execute atomically, different microthreads can still do things like incrementing
> the same counter in memory, but as they are allowed to execute sequentially, they are not allowed
> to wait data from one another microthread because that might cause a deadlock.
>
> This kind of microthreading should be VERY fast for creating new threads and joining existing ones, as
> it can be done fully on hardware: If there is a free HW microthread available, the ufork instruction just
> sets the PC and RAT for it(both threads get the same register contents) and activates it, if there are
> no, the beginning instruction address of the microthread is and GPR contents is put into (HW) queue and
> no new thread is started yet. When some thread exits, new one is taken from the queue and executed.
>
> The HW is free to even start executing the microthreads in SIMT way.
>
> As there is no limit how many waiting microthreads there may be, there is a hardware-based stack which can spill
> the PC and GPRs of a waiting microthread into memory into special protected memory location allocated for it
> by the OS. The only thing the OS does is allocating space for the stack. The same protected stack might also
> handle function return addresses so that those cannot be overwritten by normal load/store operations.
Here instead of a full hardware stack/queue I was imagining that the fork instruction would specify a memory location for the new microthread context data including a field for a pointer to the next microthread in the queue (or previous, if microthreads are FILO) and the hardware itself only have to specify two register to control this, a first and last item of the queue.
> Lowest level or the hierarchy is single instruction, which reads data from
> internal bus/pipeline register, one register or immediate parameter, and produces
> value in the same internal bus/pipeline register than what it read
>
> Second level hierarchy is a serial bypass-bundle, something like 0-4 instructions. The bundle
> header contains one register read, number of instructions in the bundle, and target register
> index where the value from the (last instruction of the) bundle is written to. 0-instruction
> bundle is just a move. No interrupts/exceptions or side effects may happen inside the bundle,
> only after it/between them. Whole bundle practically executes atomically.
>
> OoOE can be done freely between the bundles, theoretically
> bundles could also be split by the HW dynamically.
>
> The next level hierarchy is a basic block. The jump do not have to be the last instruction of the basic
> block, for example the header for the basic block of a loop body may say that the basic block contains
> 4 bypass bundles and then branch can be already in the first one, but as the size of the basic block
> is 4 bundles, all the instructions in all of those 4 bundles get executed. The basic block header may
> also contain a loop count (either immediate or index to a register containing it). In this case no branch
> instructions are needed inside the loop and the basic block is still executed N times.
My mental model of a ideal ISA is very similar to what you describe.
> The next level of the hierarchy is microthreading, which is done fully on hardware, no SW overhead of
> starting threads more than single instrucion for fork and single instruction for join. All the microthreads
> have same virtual memory mapping, and there are just one instruction which starts another microthread,
> returning odd value to one thread and even value to another thread, and couple of different instructions
> to join thread. Some of these thread join instructions also perform some reduce operation on single-register
> return value data from both micro-threads, for example min, max, or add.
>
> The implementation is always free to execute the microthreads sequentially (common case if all our
> hardware microthreads are already in use, for example started by outer level function); programmer
> can write his code/compiler can compile the code like he/it has infinite amount of microthreads
> available. As the bundles execute atomically, different microthreads can still do things like incrementing
> the same counter in memory, but as they are allowed to execute sequentially, they are not allowed
> to wait data from one another microthread because that might cause a deadlock.
>
> This kind of microthreading should be VERY fast for creating new threads and joining existing ones, as
> it can be done fully on hardware: If there is a free HW microthread available, the ufork instruction just
> sets the PC and RAT for it(both threads get the same register contents) and activates it, if there are
> no, the beginning instruction address of the microthread is and GPR contents is put into (HW) queue and
> no new thread is started yet. When some thread exits, new one is taken from the queue and executed.
>
> The HW is free to even start executing the microthreads in SIMT way.
>
> As there is no limit how many waiting microthreads there may be, there is a hardware-based stack which can spill
> the PC and GPRs of a waiting microthread into memory into special protected memory location allocated for it
> by the OS. The only thing the OS does is allocating space for the stack. The same protected stack might also
> handle function return addresses so that those cannot be overwritten by normal load/store operations.
Here instead of a full hardware stack/queue I was imagining that the fork instruction would specify a memory location for the new microthread context data including a field for a pointer to the next microthread in the queue (or previous, if microthreads are FILO) and the hardware itself only have to specify two register to control this, a first and last item of the queue.