By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), April 3, 2021 10:41 am
Room: Moderated Discussions
Carson (carson.delete@this.example.edu) on April 1, 2021 11:13 pm wrote:
> I agree that a hardware predictor is probably required for a quality implementation,
> but there are two things that make it unlike other predictors:
>
>
>
> The only way to make a useful prediction is to combine the predicted probability of
> HTM succeeding with the (hopefully easily measurable) costs of the various paths.
I agree that both of the issues you mention are real, but I think your "solution" to them is a classic case of "perfect is the enemy of good".
The thing is, absolutely nobody has the kind of perfect information that your solution requires. Not the hardware, but most certainly not software either.
And hardware at least has a better notion of what the costs are than software does. In particular, hardware can estimate the abort costs fairly well, in ways software has no idea.
And the other cost you mention is not actually all that hard to at least give a fairly good approximation for. That factor of "C(fallback) - C(success)" is not some kind of very complex cost - in fact it should be pretty much a constant for a particular microarchitecture. It's basically "how much does transactional memory win", and it's basically the cost of locking.
Sure, that cost of locking will depend on the amount of contention on the lock that would have been elided, but considering that you had a transaction failure, you already should have a reasonable idea of the level of contention. And again, even if you don't have a great estimate, there's no question that the hardware has a better estimate than software would have.
And the unidirectional error handling really isn't some fundamental problem either.
Look here: if you know your abort cost, and you have some rough idea of how much you expect to win from doing the operation as a transaction, you should now have a rough idea how hard you should avoid aborts. Maybe you don't stop doing transactions after the first abort, but after you get a second abort (with some aging logic), you stop doing that transaction as a transaction for a longish while (again, some aging logic).
You know after two failures (again: "some aging logic", so "two failures" means basically "at least somewhat common") that you already lost a lot more than you would have won, so it's time to cut your losses, and stop doing that transaction entirely. At least for a good while - you can try again later to see if maybe things have changed. That "later" may be something really simple like "clear all transaction failure counters every million cycles" or whatever.
And you can obviously tune this - if you have a capacity failure, stop that transaction immediately, there's no point in failing multiple times at all.
But fundamentally, you don't need to be "perfect". You need to be better than what you are without the predictor. And honestly, that's not a very high bar to meet.
In the extreme case, you can approximate that cost/win thing as almost a constant (probably per-microarchitecture), and argue that the bigger the sequence you aborted, potentially the bigger the win was. It's pretty much what a software solution would have had to do anyway, but a software solution would have been prohibitively expensive, because updating some per-transaction counters for every transaction would pretty much entirely eat up any advantage you got from doing it as a transaction in the first place.
So the could be as simple as something like "age the failure counter by some factor every time you encounter a transaction start, and then try it again when the failure has aged away".
End result: sure, your counting logic is different from a conditional branch, but everything else (ie how to actually associate the code flow that has the transaction with the counting logic) is still largely the same.
And yes, I'm handwaving wildly. You obviously need to actually do a very good microarchitectural simulator and modeling with real-life problems (which implies that your simulator is fast enough to run real-life problems) to really tune the thing. But that's no different from what any competent CPU architecture group already has to do anyway.
If you tell me that your CPU architecture group doesn't have a good high-performance simulator for your architecture that can run real loads, then you're telling me that your architecture group is incompetent and shouldn't have implemented a new feature like TSX in the first place.
And no, none of this is needed if you can statically prove that your transactions will succeed sufficiently often to make up for the failure cases. That's how all those "look how nice transactional memory" benchmarks work.
Linus
> I agree that a hardware predictor is probably required for a quality implementation,
> but there are two things that make it unlike other predictors:
>
>
- It only gets unidirectional error information; if it predicts "don't try HTM", there's no indication that HTM would have worked. (So occasional re-tests of the HTM path are necessary.)
- The costs are very unbalanced and variable. It's not like a branch where you want to pick the >50% prediction. Even more expensive things like prefetchers have cost ratios close enough to fixed to bake into the hardware. (You don't need a cost ratio more exact than your fuzzy prediction probability.)
>
>
> The only way to make a useful prediction is to combine the predicted probability of
> HTM succeeding with the (hopefully easily measurable) costs of the various paths.
I agree that both of the issues you mention are real, but I think your "solution" to them is a classic case of "perfect is the enemy of good".
The thing is, absolutely nobody has the kind of perfect information that your solution requires. Not the hardware, but most certainly not software either.
And hardware at least has a better notion of what the costs are than software does. In particular, hardware can estimate the abort costs fairly well, in ways software has no idea.
And the other cost you mention is not actually all that hard to at least give a fairly good approximation for. That factor of "C(fallback) - C(success)" is not some kind of very complex cost - in fact it should be pretty much a constant for a particular microarchitecture. It's basically "how much does transactional memory win", and it's basically the cost of locking.
Sure, that cost of locking will depend on the amount of contention on the lock that would have been elided, but considering that you had a transaction failure, you already should have a reasonable idea of the level of contention. And again, even if you don't have a great estimate, there's no question that the hardware has a better estimate than software would have.
And the unidirectional error handling really isn't some fundamental problem either.
Look here: if you know your abort cost, and you have some rough idea of how much you expect to win from doing the operation as a transaction, you should now have a rough idea how hard you should avoid aborts. Maybe you don't stop doing transactions after the first abort, but after you get a second abort (with some aging logic), you stop doing that transaction as a transaction for a longish while (again, some aging logic).
You know after two failures (again: "some aging logic", so "two failures" means basically "at least somewhat common") that you already lost a lot more than you would have won, so it's time to cut your losses, and stop doing that transaction entirely. At least for a good while - you can try again later to see if maybe things have changed. That "later" may be something really simple like "clear all transaction failure counters every million cycles" or whatever.
And you can obviously tune this - if you have a capacity failure, stop that transaction immediately, there's no point in failing multiple times at all.
But fundamentally, you don't need to be "perfect". You need to be better than what you are without the predictor. And honestly, that's not a very high bar to meet.
In the extreme case, you can approximate that cost/win thing as almost a constant (probably per-microarchitecture), and argue that the bigger the sequence you aborted, potentially the bigger the win was. It's pretty much what a software solution would have had to do anyway, but a software solution would have been prohibitively expensive, because updating some per-transaction counters for every transaction would pretty much entirely eat up any advantage you got from doing it as a transaction in the first place.
So the could be as simple as something like "age the failure counter by some factor every time you encounter a transaction start, and then try it again when the failure has aged away".
End result: sure, your counting logic is different from a conditional branch, but everything else (ie how to actually associate the code flow that has the transaction with the counting logic) is still largely the same.
And yes, I'm handwaving wildly. You obviously need to actually do a very good microarchitectural simulator and modeling with real-life problems (which implies that your simulator is fast enough to run real-life problems) to really tune the thing. But that's no different from what any competent CPU architecture group already has to do anyway.
If you tell me that your CPU architecture group doesn't have a good high-performance simulator for your architecture that can run real loads, then you're telling me that your architecture group is incompetent and shouldn't have implemented a new feature like TSX in the first place.
And no, none of this is needed if you can statically prove that your transactions will succeed sufficiently often to make up for the failure cases. That's how all those "look how nice transactional memory" benchmarks work.
Linus