By: Jeffrey Bosboom (firstinitiallastname.delete@this.firstnamelastname.com), November 4, 2022 10:37 pm
Room: Moderated Discussions
Mark Roulo (nothanks.delete@this.xxx.com) on November 4, 2022 8:34 pm wrote:
> Is your question: Why would a CPU not allow two independent 256-bit vector instructions
> to execute simultaneously in the top and bottom halves of a 512-bit vector?
Sorry, my question is a bit confused because I am a bit confused. Let me expand a bit, and you can correct me where I'm wrong or explain how I'm not seeing this correctly.
I see (at least) four design points here:
0) One 256-bit unit. Crack 512-bit instructions into two 256-bit uops, executed sequentially. Minimizes execution unit area and register file port count and width while supporting 512-bit ISA for decreased code size or software compatibility.
1) Two 256-bit units. Crack 512-bit instructions into two 256-bit uops, scheduled however they're mixed with 256-bit uops. Increases execution unit utilization by allowing [256, 512 first half] [512 second half, 256] pairing; requires more but narrower register file ports.
2) One 512-bit unit that can execute one 512-bit instruction or two 256-bit instructions. Allows a single 256-bit instruction to block a 512-bit instruction, leaving half the unit idle (or the scheduler stalls the 256-bit instruction until it can pair, increasing latency), but not cracking 512-bit means fewer uops through the pipeline and in the uop cache. Requires the same number of register file ports as 1) when executing 256-bit ops, but also needs wide ports for 512-bit ops.
3) One 512-bit unit that can execute one instruction regardless of width. Wastes half of the unit when executing 256-bit instructions. Requires fewer but wider register file ports.
I can see why a designer would choose 0 for a low-perf, low-area design. Of the other three, 1 seems clearly better than 2 or 3. So my questions are:
- Why would a designer choose 2 over 1?
- Why would a designer choose 3 over 1 or 2?
> Is your question: Why would a CPU not allow two independent 256-bit vector instructions
> to execute simultaneously in the top and bottom halves of a 512-bit vector?
Sorry, my question is a bit confused because I am a bit confused. Let me expand a bit, and you can correct me where I'm wrong or explain how I'm not seeing this correctly.
I see (at least) four design points here:
0) One 256-bit unit. Crack 512-bit instructions into two 256-bit uops, executed sequentially. Minimizes execution unit area and register file port count and width while supporting 512-bit ISA for decreased code size or software compatibility.
1) Two 256-bit units. Crack 512-bit instructions into two 256-bit uops, scheduled however they're mixed with 256-bit uops. Increases execution unit utilization by allowing [256, 512 first half] [512 second half, 256] pairing; requires more but narrower register file ports.
2) One 512-bit unit that can execute one 512-bit instruction or two 256-bit instructions. Allows a single 256-bit instruction to block a 512-bit instruction, leaving half the unit idle (or the scheduler stalls the 256-bit instruction until it can pair, increasing latency), but not cracking 512-bit means fewer uops through the pipeline and in the uop cache. Requires the same number of register file ports as 1) when executing 256-bit ops, but also needs wide ports for 512-bit ops.
3) One 512-bit unit that can execute one instruction regardless of width. Wastes half of the unit when executing 256-bit instructions. Requires fewer but wider register file ports.
I can see why a designer would choose 0 for a low-perf, low-area design. Of the other three, 1 seems clearly better than 2 or 3. So my questions are:
- Why would a designer choose 2 over 1?
- Why would a designer choose 3 over 1 or 2?