By: Andrey (andrey.semashev.delete@this.gmail.com), April 2, 2021 3:00 pm
Room: Moderated Discussions
David Kanter (dkanter.delete@this.realworldtech.com) on March 31, 2021 8:41 pm wrote:
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on March 31, 2021 5:08 pm wrote:
> > anon2 (anon.delete@this.anon.com) on March 31, 2021 3:46 pm wrote:
> > > >
> > > > For example, I see no sign that the ARM 'tstart' instruction has a success predictor
> > > > behind it. And once again - without a hardware predictor, you can make up benchmarks
> > > > that show how well it works, but real life will bite you in the arse.
> > >
> > > Wouldn't that be purely microarchitectural? What kind of sign would you expect to see
> > > if they intended to implement such a thing (which I agree seems like a good idea).
> >
> > I agree that it could be seen as purely a microarchitectural detail, and not visible to users.
> >
> > However, even in that case, I'd expect there to be signs of it in the architecture definition.
> > For example, the 'tstart' instruction should look a lot more like a branch, so that the predictor
> > logic could act on it exactly that way, and just go to the fallback case.
> >
> > Another sign that ARM is not designing it with a transaction predictor in mind
> > is that the result register doesn't have a "predicted not successful" case.
> >
> > That said, both could be added later: the first by simply just treating the 'tstart/cbnz'
> > sequence as one fused instruction, and the second by adding a new error code. But since
> > it's not there architecturally in the initial version, I'd expect that software then
> > has to do the prediction for it, and then you're kind of stuck with that garbage.
> >
> > In fact, looking at the definition of 'tstart', I see all the same old
> > signs that "yup, software is supposed to guess whether to try again".
> >
> > And there is zero question that you absolutely need prediction.
> > Particularly with big transactions, you simply
> > cannot afford to do a lot of work, only to then cause a
> > failure just because of transaction size (and we know
> > that some transactions will be fundamentally too large, if you try to use 'tstart' for locking).
> >
> > If the hardware doesn't do it, then the software has to do it, and that involves having software
> > try to keep track of "this lock taker in this context has failed before due to transaction size
> > issues, so let's not do the HW TM now because we know it's likely going to fail again".
> >
> > That kind of thing is expensive to do in software. You need to have counters for
> > the the success/failure cases, and you need to somehow associate those counters
> > with a particular code flow. Exactly like branch prediction hardware does.
> >
> > Honestly, anybody who tells me that software could do branch prediction is somebody who I
> > wouldn't let near a new architecture. So why the h*ll do people think that software should
> > do transaction success prediction? It's the exact same thing, with the exact same issues.
> >
> > Go look at the ARM papers, and tell me that there is any sign that they actually thought
> > about this all. Because I don't see it. I see them barreling down the exact same mistakes
> > that we've already seen with x86 and ppc, both of which have been abject failures.
> >
> > Anybody remember what the definition of insanity is, again?
>
> As someone who spent a ton of time working on HTM and speculative multi-threading, I'd like
> to echo Linus' view that handling prediction and transaction scope is a real problem.
>
> The system we designed at Strandera used dynamically sized transactions and did a lot of analysis
> to scope correctly. Without that, your life will get very unpleasant in a hurry.
>
> One of the key problems with early implementations (e.g., Sun's Rock) was that too many
> things could cause aborts. You really need to ensure that TX abort is really rare.
Given that aborts are inherently expensive and developers are supposed to avoid using HTM in contexts where frequent aborts are a concern, wouldn't the predictor just waste silicon and power? This is unlike branches, which tend to be taken or not taken in different places and depending on input data.
Sure, the predictor would help when the probability of transaction success changes depending on operating conditions. However, as others have commented, it isn't clear how to implement it efficiently in hardware, and its behavior should be largely influenced by the software specifics, so perhaps a software predictor is better?
As to minimizing the probability of aborts, I think this is where the hardware could actually help, possibly in a more straightforward way. For example, make the transaction state less volatile, so that when a page fault or interrupt happens the kernel is able to save and restore the transaction state before resuming it. The hardware would have to ensure that the restored state is still actual (i.e. that the affected cache lines are still unmodified in cache) and only abort the transaction if not.
Another improvement would be to reduce the influence of the sibling hyperthreads - perhaps, by locking the affected cache lines in the cache. Or by straight suspending the sibling hyperthread for the duration of the transaction.
There are probably other microarchitectural reasons for transaction aborts, and I don't know how difficult it would be to tackle them. But the end result could reduce the amount of spurious aborts enough so that, with help of software developers, hardware predictor is not so needed.
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on March 31, 2021 5:08 pm wrote:
> > anon2 (anon.delete@this.anon.com) on March 31, 2021 3:46 pm wrote:
> > > >
> > > > For example, I see no sign that the ARM 'tstart' instruction has a success predictor
> > > > behind it. And once again - without a hardware predictor, you can make up benchmarks
> > > > that show how well it works, but real life will bite you in the arse.
> > >
> > > Wouldn't that be purely microarchitectural? What kind of sign would you expect to see
> > > if they intended to implement such a thing (which I agree seems like a good idea).
> >
> > I agree that it could be seen as purely a microarchitectural detail, and not visible to users.
> >
> > However, even in that case, I'd expect there to be signs of it in the architecture definition.
> > For example, the 'tstart' instruction should look a lot more like a branch, so that the predictor
> > logic could act on it exactly that way, and just go to the fallback case.
> >
> > Another sign that ARM is not designing it with a transaction predictor in mind
> > is that the result register doesn't have a "predicted not successful" case.
> >
> > That said, both could be added later: the first by simply just treating the 'tstart/cbnz'
> > sequence as one fused instruction, and the second by adding a new error code. But since
> > it's not there architecturally in the initial version, I'd expect that software then
> > has to do the prediction for it, and then you're kind of stuck with that garbage.
> >
> > In fact, looking at the definition of 'tstart', I see all the same old
> > signs that "yup, software is supposed to guess whether to try again".
> >
> > And there is zero question that you absolutely need prediction.
> > Particularly with big transactions, you simply
> > cannot afford to do a lot of work, only to then cause a
> > failure just because of transaction size (and we know
> > that some transactions will be fundamentally too large, if you try to use 'tstart' for locking).
> >
> > If the hardware doesn't do it, then the software has to do it, and that involves having software
> > try to keep track of "this lock taker in this context has failed before due to transaction size
> > issues, so let's not do the HW TM now because we know it's likely going to fail again".
> >
> > That kind of thing is expensive to do in software. You need to have counters for
> > the the success/failure cases, and you need to somehow associate those counters
> > with a particular code flow. Exactly like branch prediction hardware does.
> >
> > Honestly, anybody who tells me that software could do branch prediction is somebody who I
> > wouldn't let near a new architecture. So why the h*ll do people think that software should
> > do transaction success prediction? It's the exact same thing, with the exact same issues.
> >
> > Go look at the ARM papers, and tell me that there is any sign that they actually thought
> > about this all. Because I don't see it. I see them barreling down the exact same mistakes
> > that we've already seen with x86 and ppc, both of which have been abject failures.
> >
> > Anybody remember what the definition of insanity is, again?
>
> As someone who spent a ton of time working on HTM and speculative multi-threading, I'd like
> to echo Linus' view that handling prediction and transaction scope is a real problem.
>
> The system we designed at Strandera used dynamically sized transactions and did a lot of analysis
> to scope correctly. Without that, your life will get very unpleasant in a hurry.
>
> One of the key problems with early implementations (e.g., Sun's Rock) was that too many
> things could cause aborts. You really need to ensure that TX abort is really rare.
Given that aborts are inherently expensive and developers are supposed to avoid using HTM in contexts where frequent aborts are a concern, wouldn't the predictor just waste silicon and power? This is unlike branches, which tend to be taken or not taken in different places and depending on input data.
Sure, the predictor would help when the probability of transaction success changes depending on operating conditions. However, as others have commented, it isn't clear how to implement it efficiently in hardware, and its behavior should be largely influenced by the software specifics, so perhaps a software predictor is better?
As to minimizing the probability of aborts, I think this is where the hardware could actually help, possibly in a more straightforward way. For example, make the transaction state less volatile, so that when a page fault or interrupt happens the kernel is able to save and restore the transaction state before resuming it. The hardware would have to ensure that the restored state is still actual (i.e. that the affected cache lines are still unmodified in cache) and only abort the transaction if not.
Another improvement would be to reduce the influence of the sibling hyperthreads - perhaps, by locking the affected cache lines in the cache. Or by straight suspending the sibling hyperthread for the duration of the transaction.
There are probably other microarchitectural reasons for transaction aborts, and I don't know how difficult it would be to tackle them. But the end result could reduce the amount of spurious aborts enough so that, with help of software developers, hardware predictor is not so needed.