By: Exophase (exophase.delete@this.gmail.com), May 7, 2013 9:00 am
Room: Moderated Discussions
Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on May 7, 2013 8:17 am wrote:
> It is similar indeed. A load-op may have its load dispatched to the memory reservation station
> in one cycle (possibly with a previous instruction), and in the next cycle it dispatches the
> ALU operation (possibly with a next instruction). That is simpler and more power efficient
> than cracking into uops much earlier (which would require larger buffers throughout).
>
If Silvermont is anything like Pentium M onwards then load + op will use one ROB/RS entry but be dispatched at two independent cycles whenever the operands are ready and not necessarily two cycles apart.
> Whether you crack early or late, a load+op effectively uses 2 cycles, just like you wrote a separate
> load and alu instruction. So it is silly to claim macro instructions improve performance or make
> your CPU appear wider like Anand did. Unlike the old Atom where load+op could actually improve performance,
> you now want to avoid them like all other x86 cores unless they are single-use.
Unless there are unknown restrictions and it's confirmed that the two parts can't be scheduled independently I don't see why you'd want to avoid it. It still reduces pressure everywhere in the pipeline except the execution units, particularly decode/retire which can be one of the bigger bottlenecks in the pipeline. There are easily imaginable scenarios where you can get higher instruction throughput using load + op and load + op + store.
> Yes, it would be wasteful to go for 4 inputs and 2 outputs just to model the temporary, so it is likely
> handled specially. Eg. for load+op it is always written once, then read once, and dead after that.
I wonder how AMD does it. Jaguar for instance (and I presume its predecessors) decodes load + op + store instructions to macro-ops with an op and load + store micro-op. I don't see why you can't use the same allocation for the load destination and op source + destination, even though it means it has to be both an input and output.
> It is similar indeed. A load-op may have its load dispatched to the memory reservation station
> in one cycle (possibly with a previous instruction), and in the next cycle it dispatches the
> ALU operation (possibly with a next instruction). That is simpler and more power efficient
> than cracking into uops much earlier (which would require larger buffers throughout).
>
If Silvermont is anything like Pentium M onwards then load + op will use one ROB/RS entry but be dispatched at two independent cycles whenever the operands are ready and not necessarily two cycles apart.
> Whether you crack early or late, a load+op effectively uses 2 cycles, just like you wrote a separate
> load and alu instruction. So it is silly to claim macro instructions improve performance or make
> your CPU appear wider like Anand did. Unlike the old Atom where load+op could actually improve performance,
> you now want to avoid them like all other x86 cores unless they are single-use.
Unless there are unknown restrictions and it's confirmed that the two parts can't be scheduled independently I don't see why you'd want to avoid it. It still reduces pressure everywhere in the pipeline except the execution units, particularly decode/retire which can be one of the bigger bottlenecks in the pipeline. There are easily imaginable scenarios where you can get higher instruction throughput using load + op and load + op + store.
> Yes, it would be wasteful to go for 4 inputs and 2 outputs just to model the temporary, so it is likely
> handled specially. Eg. for load+op it is always written once, then read once, and dead after that.
I wonder how AMD does it. Jaguar for instance (and I presume its predecessors) decodes load + op + store instructions to macro-ops with an op and load + store micro-op. I don't see why you can't use the same allocation for the load destination and op source + destination, even though it means it has to be both an input and output.