By: Victor Alander (Victor.alander.delete@this.gmail.com), March 21, 2021 5:14 pm
Room: Moderated Discussions
Mark Roulo (nothanks.delete@this.xxx.com) on March 21, 2021 9:47 am wrote:
> I wasn't clear.
>
> I don't want the ability to run threads/tasks in parallel added to the ISA/IR. I know how to do that :-)
>
> I want the ability to run zero-overhead independent threads in parallel added to the ISA. Think
> of something more like CUDA/GPU blocks. Each block is supposed to be independent and the underlying
> hardware is permitted to run as many blocks in parallel as it has resources for. In the limit (though
> I don't think this has ever happened) the blocks would run sequentially. More expensive GPUs can
> run over 100 blocks in parallel. The overhead to launch these is small (zero?).
>
> The problem with OS threads/tasks is that they are so heavyweight that many
> cases where you COULD logically parallelize code (or, really, just NOT serialize
> it) aren't worth doing because the overhead kills all the gains.
>
> There is a chicken-and-egg p problem here, of course.
> *) The languages need to support this, and
> *) The hardware needs to support this too (or no one will use the feature)
>
> For lots of code I see this would be quite nice. But I suspect my code may be pretty
> niche and I don't know how often this would be useful in more general purpose code.
I have been toying with the idea of a maximally parallel instruction set. The idea is that each instruction has a buffer for it's parameters and a count for how many bytes remain to be written before the instruction has all its inputs so that it can run. When a instruction finishes it has a list of relative addresses to copy all it's output parameters to and each time it copies a byte it decrements the receiving instructions parameters_remaining count by one. A instruction is queued to run in it's relevant function unit when it's parameters_remaining count reaches zero.
size could be inferred from the instruction but it might be good to have it in the asm for fast look ahead.
parameters_remaining is a atomic count of how many bytes remain to be written to the buffer. It can be lower than the number of inputs a instruction has if some inputs are constants.
instruction has the same size as a pointer and may be the memory mapped location of a function unit or the entry point of a function block implementing the same functionality in code. The instruction may be replaced with a implementation specific pointer when the program is loaded(the ISA is like Unicode where local representations may map the code points however they wish).
[parameters] is a fixed size array for a instructions input and output parameters. It may only be written twice, once when parameters are set by preceding instructions and a second time when the array is overwritten by a copy from the returning function unit.
[[outputs]] is a fixed size compile time constant 2d-array specifying relative copies for each byte of the output parameters, each byte may be sent to multiple destinations. It may not be modified during execution. That each move is to a valid location in a instruction's parameter list in the same function unit can be verified at load.(parameters_remaining can be verified at the same time so that there is always as many writes to [parameters] as parameters_remaining specifies.
When a function block is entered a memory block is allocated for it and it's instructions are copied over in full or only constant parameters(with a reference back to the code). The first case is preferred for code locality while the second offer better security by separating instructions and data. The function block is deallocated (or reused if it is recursive) when it's return statement's parameters_remaining reaches zero.
Function calls/instructions and returns/jumps are the only parts that ever handle absolute pointers to asm. Code may handle pointers to data.
My attempt at writing a ToUpper function in a hypothetical ISA, format is:
instruction(inputs)(outputs)
outputs can take over variable names or be discarded in to the void.
The file format would include a table with the location of all function unit entry points(and also instructions) allowing for embarrassingly parallel program load.
Self modifying code is not allowed except for the modification of jump locations restricted to a predefined list of allowed destinations(function unit entry points). The program is never given actual jump addresses only indexes/keys.
Possible embarrassingly parallel OOO security measures may include:
That will hopefully stop all code injection and unaligned jump attacks.
> I wasn't clear.
>
> I don't want the ability to run threads/tasks in parallel added to the ISA/IR. I know how to do that :-)
>
> I want the ability to run zero-overhead independent threads in parallel added to the ISA. Think
> of something more like CUDA/GPU blocks. Each block is supposed to be independent and the underlying
> hardware is permitted to run as many blocks in parallel as it has resources for. In the limit (though
> I don't think this has ever happened) the blocks would run sequentially. More expensive GPUs can
> run over 100 blocks in parallel. The overhead to launch these is small (zero?).
>
> The problem with OS threads/tasks is that they are so heavyweight that many
> cases where you COULD logically parallelize code (or, really, just NOT serialize
> it) aren't worth doing because the overhead kills all the gains.
>
> There is a chicken-and-egg p problem here, of course.
> *) The languages need to support this, and
> *) The hardware needs to support this too (or no one will use the feature)
>
> For lots of code I see this would be quite nice. But I suspect my code may be pretty
> niche and I don't know how often this would be useful in more general purpose code.
I have been toying with the idea of a maximally parallel instruction set. The idea is that each instruction has a buffer for it's parameters and a count for how many bytes remain to be written before the instruction has all its inputs so that it can run. When a instruction finishes it has a list of relative addresses to copy all it's output parameters to and each time it copies a byte it decrements the receiving instructions parameters_remaining count by one. A instruction is queued to run in it's relevant function unit when it's parameters_remaining count reaches zero.
(size), parameters_remaining, instruction, [parameters], [[copy output 1 to +100, +29],[copy output 2 to nowhere],...,[copy output n to +256]]
size could be inferred from the instruction but it might be good to have it in the asm for fast look ahead.
parameters_remaining is a atomic count of how many bytes remain to be written to the buffer. It can be lower than the number of inputs a instruction has if some inputs are constants.
instruction has the same size as a pointer and may be the memory mapped location of a function unit or the entry point of a function block implementing the same functionality in code. The instruction may be replaced with a implementation specific pointer when the program is loaded(the ISA is like Unicode where local representations may map the code points however they wish).
[parameters] is a fixed size array for a instructions input and output parameters. It may only be written twice, once when parameters are set by preceding instructions and a second time when the array is overwritten by a copy from the returning function unit.
[[outputs]] is a fixed size compile time constant 2d-array specifying relative copies for each byte of the output parameters, each byte may be sent to multiple destinations. It may not be modified during execution. That each move is to a valid location in a instruction's parameter list in the same function unit can be verified at load.(parameters_remaining can be verified at the same time so that there is always as many writes to [parameters] as parameters_remaining specifies.
When a function block is entered a memory block is allocated for it and it's instructions are copied over in full or only constant parameters(with a reference back to the code). The first case is preferred for code locality while the second offer better security by separating instructions and data. The function block is deallocated (or reused if it is recursive) when it's return statement's parameters_remaining reaches zero.
Function calls/instructions and returns/jumps are the only parts that ever handle absolute pointers to asm. Code may handle pointers to data.
My attempt at writing a ToUpper function in a hypothetical ISA, format is:
instruction(inputs)(outputs)
outputs can take over variable names or be discarded in to the void.
@readLine func , , , ;
readStream [readLine:3]; //returns ,
append [readLine:4], [readLine+1:1]; //returns ,
sub [readLine:1], [readLine+1:1]; //returns ,
ifElseZero [readLine], [readLine:0], [readLine+3:1], // select first or second function to jump to depending on third parameter being zero
[readLine:1], [readLine+1:0], [readLine+2:0]; // return is identical to function declaration
func readLine( delimiter, cin, buffer){
readStream(cin, void)(cin, char);
append(buffer, char)(buffer, void);
sub(delimiter, char)(void, comp);
returnOnZero(comp, delimiter, cin, buffer);//or returnOnZero(comp);
}
func writeBuffer( cout, <iterator<> pos, <iterator<> end){
<iterator<> pos(buffer.begin), end(buffer.end)
readIterator(pos, void)(void, char);
incrementIterator(pos)(pos);
writeStream(cout, char)(cout, void);
sub(pos, end)(void, comp);
returnOnZero(comp, cout, pos, end);
}
func toUpper( cin, cout){
buffer();
delimiter(10), diff(32), lower('a'), upper('z');
readLine(delimiter, cin, buffer)(delimiter, cin, buffer);
apply(buffer, x >= lower && x <= upper ? x - diff : x)(buffer);//how to implement SIMD?
<iterator<> pos(buffer.begin), end(buffer.end);
writeBuffer(cout, pos, end)(cout, pos, end);
return(cin, cout);
}
The file format would include a table with the location of all function unit entry points(and also instructions) allowing for embarrassingly parallel program load.
Self modifying code is not allowed except for the modification of jump locations restricted to a predefined list of allowed destinations(function unit entry points). The program is never given actual jump addresses only indexes/keys.
Possible embarrassingly parallel OOO security measures may include:
func jump_check(dest, parameters, definition){
if(!set_of_all_entery_points.in_set(dest) || parameters != definition){die();}
}
func instruction_check(instruction){
if(!set_of_all_loaded_instructions.in_set(&instruction << sizeof(instruction) | instruction)){die();}
}
That will hopefully stop all code injection and unaligned jump attacks.