By: Etienne Lorrain (etienne_lorrain.delete@this.yahoo.fr), March 24, 2021 1:13 am
Room: Moderated Discussions
Veedrac (ignore.delete@this.this.com) on March 23, 2021 9:31 am wrote:
> Heikki Kultala (heikk.i.kultal.a.delete@this.gmail.com) on March 23, 2021 6:46 am wrote:
> > Veedrac (ignore.delete@this.this.com) on March 23, 2021 3:49 am wrote:
> > > Heikki Kultala (heikki.kultal.a.delete@this.gmail.com) on March 22, 2021 6:52 pm wrote:
> > > >
> > > > The implementation is always free to execute the microthreads sequentially (common case if all our
> > > > hardware microthreads are already in use, for example started by outer level function); programmer
> > > > can write his code/compiler can compile the code like he/it has infinite amount of microthreads
> > > > available. As the bundles execute atomically, different microthreads can still do things like incrementing
> > > > the same counter in memory, but as they are allowed to execute sequentially, they are not allowed
> > > > to wait data from one another microthread because that might cause a deadlock.
> > >
> > > Personally I expect this to be very limiting because it means you must spawn microthreads at the
> > > top level in an order that completely the respects dependencies of all the sub-threads.
> >
> > These are not meant to be spawned at the top level at all. At top level,
> > you spawn normal threads and execute those on totally different cores.
> >
> > These are meant for a limited thing. They are only meant to be a slight bonus
> > on top of a core that has excellent per-thread-performance by other means.
> >
> > These microthreads are for only fully parallel very small
> > things, meant to be used in very small granularity.
> > These are not meant to be replacing normal threads on existing code but to be used for things where threads
> > currently cannot be used because of the overheads. And typically inserted automatically by the compiler.
> >
> > For example, if you have a (small) fully data parallel
> > for loop, In addition to vectorizing it, you may also
> > split it into 2-4 parts and launch a microthread for each
> > part. Compared to normal threads, the benefit of these
> > microthreads is much smaller overhead so that there is no
> > need to analyze if the loop has big enough iteration
> > cont for the threading to be beneficial, accelerate those cases when the iteration count is quite small.
> >
> > Or you call some same pure function (which takes like 50 clock cycles) couple of times with
> > different parameters, you can spawn a separate microthread for each function call.
>
> Yes, we're on the same page here, just suffering from the lack of standard terminology.
>
> What I mean is that you might well want to compile g(f()) to spawn both g and f simultaneously, if their
> contents contain a parallelizable subset, but you can't do this if you can't pass values horizontally. This
> problem occurs even if g(x) actually just passes x to some g'(x); here f and g are ‘top level’.
>
> Similarly, imagine you had two passes over an array, such that they have to be run in
> order for any given element, but could happily desync arbitrarily far otherwise. This
> is very hard to express or compile for without some sort of horizontal communication.
I have a small problem if you are trying to start microthreads of less than a hundred or so cycles: If a processor with 200 instructions in flight will pause with a single thread (probably waiting for inputs), it is likely both microthreads will also pause for the same reason. Then the best/simplest way is probably get more hardware so that the single thread do not pause.
If the microthreads do something completely different (like one zeroing next malloc() block, there is probably a lot of allocations in today's software), both microthreads have less chances to require the same processor hardware and be paused at the same time.
> Heikki Kultala (heikk.i.kultal.a.delete@this.gmail.com) on March 23, 2021 6:46 am wrote:
> > Veedrac (ignore.delete@this.this.com) on March 23, 2021 3:49 am wrote:
> > > Heikki Kultala (heikki.kultal.a.delete@this.gmail.com) on March 22, 2021 6:52 pm wrote:
> > > >
> > > > The implementation is always free to execute the microthreads sequentially (common case if all our
> > > > hardware microthreads are already in use, for example started by outer level function); programmer
> > > > can write his code/compiler can compile the code like he/it has infinite amount of microthreads
> > > > available. As the bundles execute atomically, different microthreads can still do things like incrementing
> > > > the same counter in memory, but as they are allowed to execute sequentially, they are not allowed
> > > > to wait data from one another microthread because that might cause a deadlock.
> > >
> > > Personally I expect this to be very limiting because it means you must spawn microthreads at the
> > > top level in an order that completely the respects dependencies of all the sub-threads.
> >
> > These are not meant to be spawned at the top level at all. At top level,
> > you spawn normal threads and execute those on totally different cores.
> >
> > These are meant for a limited thing. They are only meant to be a slight bonus
> > on top of a core that has excellent per-thread-performance by other means.
> >
> > These microthreads are for only fully parallel very small
> > things, meant to be used in very small granularity.
> > These are not meant to be replacing normal threads on existing code but to be used for things where threads
> > currently cannot be used because of the overheads. And typically inserted automatically by the compiler.
> >
> > For example, if you have a (small) fully data parallel
> > for loop, In addition to vectorizing it, you may also
> > split it into 2-4 parts and launch a microthread for each
> > part. Compared to normal threads, the benefit of these
> > microthreads is much smaller overhead so that there is no
> > need to analyze if the loop has big enough iteration
> > cont for the threading to be beneficial, accelerate those cases when the iteration count is quite small.
> >
> > Or you call some same pure function (which takes like 50 clock cycles) couple of times with
> > different parameters, you can spawn a separate microthread for each function call.
>
> Yes, we're on the same page here, just suffering from the lack of standard terminology.
>
> What I mean is that you might well want to compile g(f()) to spawn both g and f simultaneously, if their
> contents contain a parallelizable subset, but you can't do this if you can't pass values horizontally. This
> problem occurs even if g(x) actually just passes x to some g'(x); here f and g are ‘top level’.
>
> Similarly, imagine you had two passes over an array, such that they have to be run in
> order for any given element, but could happily desync arbitrarily far otherwise. This
> is very hard to express or compile for without some sort of horizontal communication.
I have a small problem if you are trying to start microthreads of less than a hundred or so cycles: If a processor with 200 instructions in flight will pause with a single thread (probably waiting for inputs), it is likely both microthreads will also pause for the same reason. Then the best/simplest way is probably get more hardware so that the single thread do not pause.
If the microthreads do something completely different (like one zeroing next malloc() block, there is probably a lot of allocations in today's software), both microthreads have less chances to require the same processor hardware and be paused at the same time.