By: RichardC (tich.delete@this.pobox.com), May 14, 2013 12:23 pm
Room: Moderated Discussions
Jukka Larja (roskakori2006.delete@this.gmail.com) on May 14, 2013 11:34 am wrote:
> I think part of the problem is that common PC nowadays has 2C/4T processor (and not so long ago we were at
> 2C/2T). In the best case, you can have about 2.5 times the performance using 4 threads. Considering turbo boost,
> even less than that. Realistically, with some sensible effort for not totally embarrassingly parallel case,
> you can probably expect less than two times performance. I'm not sure that's worth it in most cases.
So the pitch to software developers goes like this:
- the cpu has 4 threads not just 2, why don't you use them all ?
- oh, by the way, you only get about 3.0x throughput because
those extra threads are sharing core execution resources
- oh, by the way, if all 4 are running flat-out we're going
to throttle the clockspeed, so it's really only 2.7x
- oh, by the way, they'll be sharing cache so your hit rate
will not be so good, so it's really only 2.5x
- oh, by the way, splitting your data 4 ways instead of 2 will
introduce more locking and scheduling overhead, so it may be
only 2.2x
- and it might not be faster at all, because things are just
complicated and there's no way of knowing until you do a bunch
of work and try it.
It's not really a surprise that developers for most applications have
not spent a lot of time chasing the elusive speedups from SMT.
> I think part of the problem is that common PC nowadays has 2C/4T processor (and not so long ago we were at
> 2C/2T). In the best case, you can have about 2.5 times the performance using 4 threads. Considering turbo boost,
> even less than that. Realistically, with some sensible effort for not totally embarrassingly parallel case,
> you can probably expect less than two times performance. I'm not sure that's worth it in most cases.
So the pitch to software developers goes like this:
- the cpu has 4 threads not just 2, why don't you use them all ?
- oh, by the way, you only get about 3.0x throughput because
those extra threads are sharing core execution resources
- oh, by the way, if all 4 are running flat-out we're going
to throttle the clockspeed, so it's really only 2.7x
- oh, by the way, they'll be sharing cache so your hit rate
will not be so good, so it's really only 2.5x
- oh, by the way, splitting your data 4 ways instead of 2 will
introduce more locking and scheduling overhead, so it may be
only 2.2x
- and it might not be faster at all, because things are just
complicated and there's no way of knowing until you do a bunch
of work and try it.
It's not really a surprise that developers for most applications have
not spent a lot of time chasing the elusive speedups from SMT.