By: rwessel (rwessel.delete@this.yahoo.com), August 26, 2022 10:37 am
Room: Moderated Discussions
Kevin G (kevin.getting.delete@this.cerner.com) on August 26, 2022 9:52 am wrote:
> rwessel (rwessel.delete@this.yahoo.com) on August 25, 2022 11:10 am wrote:
> > Kevin G (kevin.delete@this.cubitdesigns.com) on August 25, 2022 8:47 am wrote:
> > > Mark Roulo (nothanks.delete@this.xxx.com) on August 24, 2022 4:25 pm wrote:
> > > > Kara (karaardalan.delete@this.gmail.com) on August 24, 2022 2:07 pm wrote:
> > > > >
> > > > > Since like my second year programming I stopped writing branched code totally,
> > > > > few tens of thousands lines of code in, I haven't encounter anything that
> > > > > intrinsically requires speculative branch prediction, like logically.
> > > > >
> > > > > But I write only numes.
> > > > >
> > > > > Is there any program that can't be written without a branching that can't
> > > > > be rephrased (literally, all that is is rephrasing) into branchless?
> > > > >
> > > > > Mind risc-v don't offer BPU, there's extensions but not in the requirements.
> > > >
> > > > How do you write code to do one thing if a condition holds and another thing if the condition doesn't hold?
> > > >
> > > > Example:
> > >
> > > For something like this, multiplication of the raw boolean and the pointer to the strings to print would
> > > work. Both "Hello" and "Goodbye" multiplications need to
> > > be processes and then summed to create a new pointer
> > > for the resulting value. This assumes that true is a simple bit of 1 and false is 0. There are three main
> > > operations using this style vs. two that are actually performed in the if-else statement.
> > >
> > > This technique was leveraged heavily for arcane platforms like IBM's Cell whose SPE units had utterly
> > > poor branching performance but could run through those calculations. Due to the awkward architecture,
> > > doing more but faster operations was faster in real time than simpler but costly branching.
> >
> >
> > Now why does that sound familiar...? ;-)
>
> True, I didn't see you earlier post. :) Though you only need to perform one logical test for b and store it as
> a boolean in your example. Potentially two casting operations maybe included for the multiplications depending
> what the complier would do. I imagine the basic principle can be done in assembler rather compactly. Our common
> point being that a branch instruction isn't leveraged but a basic logical test still is performed.
>
> > I'll add that the technique goes back a really long way. It's basically predication as supported by
> > any vector machine, and was described, I think, by Von Neumann, perhaps in the context of EDVAC.
>
> Yeah, I remember hearing about this decades ago back in my college days.
>
> One key thing for throughput here is that the logical test of b has to be completed and ready to use
> prior to the multiplications being executed to prevent a small stall. The advantage here over a typical
> branching path is that it is a register dependency stall(?) rather than requiring the entire execution
> pipeline to be flushed when a branch is execute. The processor is still able to do meaningful work
> outside that bit of code with the only excess execution being the extra multiplication.
True, but the fundamental issue of predication is that it's almost always slower than a correctly predicted branch. The correctly predicted branch occupies near zero time, while there's always working being done that's going to be thrown away with predication. On the other hand, branch mispredictions are quite expensive. Predication can make sense where a branch cannot be predicted with any accuracy, and the amount of work needing to be thrown away is not too high (that's mainly meant in the context of scalar code, but the justification for predication with vector operations is essentially the same).
> rwessel (rwessel.delete@this.yahoo.com) on August 25, 2022 11:10 am wrote:
> > Kevin G (kevin.delete@this.cubitdesigns.com) on August 25, 2022 8:47 am wrote:
> > > Mark Roulo (nothanks.delete@this.xxx.com) on August 24, 2022 4:25 pm wrote:
> > > > Kara (karaardalan.delete@this.gmail.com) on August 24, 2022 2:07 pm wrote:
> > > > >
> > > > > Since like my second year programming I stopped writing branched code totally,
> > > > > few tens of thousands lines of code in, I haven't encounter anything that
> > > > > intrinsically requires speculative branch prediction, like logically.
> > > > >
> > > > > But I write only numes.
> > > > >
> > > > > Is there any program that can't be written without a branching that can't
> > > > > be rephrased (literally, all that is is rephrasing) into branchless?
> > > > >
> > > > > Mind risc-v don't offer BPU, there's extensions but not in the requirements.
> > > >
> > > > How do you write code to do one thing if a condition holds and another thing if the condition doesn't hold?
> > > >
> > > > Example:
I'll note that the conditional operator (?) is a branch.
> > > > if (b)
> > > > printf("Hello");
> > > > else
> > > > printf("Goodbye");
> > > >
> > >
> > > For something like this, multiplication of the raw boolean and the pointer to the strings to print would
> > > work. Both "Hello" and "Goodbye" multiplications need to
> > > be processes and then summed to create a new pointer
> > > for the resulting value. This assumes that true is a simple bit of 1 and false is 0. There are three main
> > > operations using this style vs. two that are actually performed in the if-else statement.
> > >
> > > This technique was leveraged heavily for arcane platforms like IBM's Cell whose SPE units had utterly
> > > poor branching performance but could run through those calculations. Due to the awkward architecture,
> > > doing more but faster operations was faster in real time than simpler but costly branching.
> >
> >
> > Now why does that sound familiar...? ;-)
>
> True, I didn't see you earlier post. :) Though you only need to perform one logical test for b and store it as
> a boolean in your example. Potentially two casting operations maybe included for the multiplications depending
> what the complier would do. I imagine the basic principle can be done in assembler rather compactly. Our common
> point being that a branch instruction isn't leveraged but a basic logical test still is performed.
>
> > I'll add that the technique goes back a really long way. It's basically predication as supported by
> > any vector machine, and was described, I think, by Von Neumann, perhaps in the context of EDVAC.
>
> Yeah, I remember hearing about this decades ago back in my college days.
>
> One key thing for throughput here is that the logical test of b has to be completed and ready to use
> prior to the multiplications being executed to prevent a small stall. The advantage here over a typical
> branching path is that it is a register dependency stall(?) rather than requiring the entire execution
> pipeline to be flushed when a branch is execute. The processor is still able to do meaningful work
> outside that bit of code with the only excess execution being the extra multiplication.
True, but the fundamental issue of predication is that it's almost always slower than a correctly predicted branch. The correctly predicted branch occupies near zero time, while there's always working being done that's going to be thrown away with predication. On the other hand, branch mispredictions are quite expensive. Predication can make sense where a branch cannot be predicted with any accuracy, and the amount of work needing to be thrown away is not too high (that's mainly meant in the context of scalar code, but the justification for predication with vector operations is essentially the same).