By: Michael S (already5chosen.delete@this.yahoo.com), August 29, 2014 8:34 am
Room: Moderated Discussions
Aaron Spink (aaronspink.delete@this.notearthlink.net) on August 29, 2014 7:35 am wrote:
>
> This also carries over into atomic. LL/SC from a hardware perspective is easy/simple. But overall its worse
> than doing the full set of real atomics. In theory LL/SC is better and can do everything, but because its better/easier,
> it generally ends up worse. And as we get into the future of a large number of cores per chip and start getting
> software that is actually coherent on those cores, the advantages of the hard atomic and the capability to leverage
> them to the fabric will likely win out.
I thing, you a mixing together to related, but not identical issues - fineness/coarseness of local primitives and optimism/pessimism of atomic solution at system scale.
Yes, coarse-grained primitives tend to be implemented in pessimistic style, i.e. at the beginning of primitive core gains M-or-equivalent ownership over cache line and refuses any attempts to change the ownership until primitive is completed. But that's not the only way to do it. Instead, [microcoded] implementation can still use loopy optimistic implementation.
The same is true in opposite direction. I.e. although it looks natural to implement fine-grained primitives in optimistic style, it's not the only way to do it. Local hardware is free, for example, to always fetch and decode 2, 3 or 4 instructions that follow LL before it even try to execute it. And if it found out, for example, that SC is following in the close proximity and that there are no memory accesses in-between then it can decide to implement the whole sequence as coarse-grained primitives and proceed in pessimistic manner.
According to some comments in this thread, that's exactly how some of Power processors did it.
> There is no reason why a FETCH_ADD has do be done at the core. In
> fact, in many cases it makes sense to do the FETCH_ADD where the line is located. AKA, ship the atomic, not
> the line. And this doesn't just apply at the coherence level, it makes just as much sense at the message passing
> level. And in both cases, it should be a performance and power win.
>
You assume particular use case, specifically the case when the next access to memory location in question is likely to come either from other core or long time thereafter.
However, if the next access is more likely to come from the same core and soon then your performance and power win suddenly became performance and power loss.
>
> This also carries over into atomic. LL/SC from a hardware perspective is easy/simple. But overall its worse
> than doing the full set of real atomics. In theory LL/SC is better and can do everything, but because its better/easier,
> it generally ends up worse. And as we get into the future of a large number of cores per chip and start getting
> software that is actually coherent on those cores, the advantages of the hard atomic and the capability to leverage
> them to the fabric will likely win out.
I thing, you a mixing together to related, but not identical issues - fineness/coarseness of local primitives and optimism/pessimism of atomic solution at system scale.
Yes, coarse-grained primitives tend to be implemented in pessimistic style, i.e. at the beginning of primitive core gains M-or-equivalent ownership over cache line and refuses any attempts to change the ownership until primitive is completed. But that's not the only way to do it. Instead, [microcoded] implementation can still use loopy optimistic implementation.
The same is true in opposite direction. I.e. although it looks natural to implement fine-grained primitives in optimistic style, it's not the only way to do it. Local hardware is free, for example, to always fetch and decode 2, 3 or 4 instructions that follow LL before it even try to execute it. And if it found out, for example, that SC is following in the close proximity and that there are no memory accesses in-between then it can decide to implement the whole sequence as coarse-grained primitives and proceed in pessimistic manner.
According to some comments in this thread, that's exactly how some of Power processors did it.
> There is no reason why a FETCH_ADD has do be done at the core. In
> fact, in many cases it makes sense to do the FETCH_ADD where the line is located. AKA, ship the atomic, not
> the line. And this doesn't just apply at the coherence level, it makes just as much sense at the message passing
> level. And in both cases, it should be a performance and power win.
>
You assume particular use case, specifically the case when the next access to memory location in question is likely to come either from other core or long time thereafter.
However, if the next access is more likely to come from the same core and soon then your performance and power win suddenly became performance and power loss.