By: Dresdenboy (mw212.delete@this.informatik.uni-rostock.de), August 21, 2004 5:01 am
Room: Moderated Discussions
Anonymous (nospam@nospam.com) on 8/20/04 wrote:
---------------------------
>>BTW, the code contains ~30% register copying MOV instructions. I'd like to see
>>x86 MPUs address this wasting of execution units in the future.
>
>Opteron already does this, by eliminating moves as part of the register renaming
>process. I think it was discussed in the Chip Architect article on the K8 core.
>This is also trivial to do in binary translation systems (Transmeta does it extensively).
>As far as I know, the only chip that doesn't is Intel.
I think the only benefit is then, that the execution unit has nothing to do and thus saves power. However, the MOVs take away decode bandwidth and space in the scheduler queues and the ROB. If K9 will have the lookahead and operand collapse units as described in many patents, this situation could improve. Then the negative effect of the MOVs, which are mostly necessary because of the 2 operand format (except for LEA), could be eliminated.
That's why I'm interested in creating some custom K8 microcode. AFAIK the RISC86 ops have Src1, Src2 and Dest1 fields - so they have a 3 operand format. And if I'd know the microinstructions I need for my optimization, I won't spend more than 2 days for implementation. I'm speaking about a key algorithm/function with just 20-30 uOps. BTW the temporal registers would offer another possibility since 16 GPRs/SSE2 regs are still not enough :)
---------------------------
>>BTW, the code contains ~30% register copying MOV instructions. I'd like to see
>>x86 MPUs address this wasting of execution units in the future.
>
>Opteron already does this, by eliminating moves as part of the register renaming
>process. I think it was discussed in the Chip Architect article on the K8 core.
>This is also trivial to do in binary translation systems (Transmeta does it extensively).
>As far as I know, the only chip that doesn't is Intel.
I think the only benefit is then, that the execution unit has nothing to do and thus saves power. However, the MOVs take away decode bandwidth and space in the scheduler queues and the ROB. If K9 will have the lookahead and operand collapse units as described in many patents, this situation could improve. Then the negative effect of the MOVs, which are mostly necessary because of the 2 operand format (except for LEA), could be eliminated.
That's why I'm interested in creating some custom K8 microcode. AFAIK the RISC86 ops have Src1, Src2 and Dest1 fields - so they have a 3 operand format. And if I'd know the microinstructions I need for my optimization, I won't spend more than 2 days for implementation. I'm speaking about a key algorithm/function with just 20-30 uOps. BTW the temporal registers would offer another possibility since 16 GPRs/SSE2 regs are still not enough :)