By: Laurent Birtz (seerdecker.delete@this.yahoo.com.au), November 23, 2012 1:45 pm
Room: Moderated Discussions
Do you guys know if the 4 uops frontend/retirement limit is the probable bottleneck in Haswell for a typical integer work load, e.g. video encoding?
The way I see it, up to four simple arithmetic operations can be done concurrently (assuming 1 cycle per uop), and there can be concurrent load/stores, for a total of 6 uops/cycle processed in the execution units. For Nehalem, I used to consider the "mov" operations as mostly free since they could be done in another port while the two SSE ALUs were busy. Am I correct in believing that this assumption no longer holds true?
My reasoning is that although Haswell processes the "mov" uops in the renamer, those uops still need to be issued from the frontend (up to 4 per cycle), so every "mov" instruction has the potential of starving an execution unit of uops to execute. If that analysis is correct, then the "mov" instructions should generally be avoided in favor of 3-operands instructions and memory operands.
Thanks for any insight.
The way I see it, up to four simple arithmetic operations can be done concurrently (assuming 1 cycle per uop), and there can be concurrent load/stores, for a total of 6 uops/cycle processed in the execution units. For Nehalem, I used to consider the "mov" operations as mostly free since they could be done in another port while the two SSE ALUs were busy. Am I correct in believing that this assumption no longer holds true?
My reasoning is that although Haswell processes the "mov" uops in the renamer, those uops still need to be issued from the frontend (up to 4 per cycle), so every "mov" instruction has the potential of starving an execution unit of uops to execute. If that analysis is correct, then the "mov" instructions should generally be avoided in favor of 3-operands instructions and memory operands.
Thanks for any insight.



