By: dmcq (dmcq.delete@this.fano.co.uk), August 18, 2021 2:56 pm
Room: Moderated Discussions
Hugo Décharnes (hdecharn.delete@this.outlook.fr) on August 18, 2021 10:34 am wrote:
> Following a previous thread on register banking, I would like to share
> a micro-architecture I have been thinking about for quite some time.
>
> Under Section 107 of the Copyright Act in 1976, allowance is made for "fair use" for purposes
> such as criticism, comment, news reporting, teaching, scholarship, and research.
>
> When widening the execution bandwidth of an out-of-order processor, some structures become contented:
>Register renaming logic, as for each renamed instruction, its source operand physical register tag
> can come from other renamed (older in program order) instruction. This is even worse with move elimination,
> as for each instruction, its destination physical register tag can come from its source operand, then
> driven as input of source operand collection of further instructions, thus creating a dependence chain.
> (This is the reason why the move elimination depth is often limited.) Instruction dispatch logic, that
> has to steer to a higher number of issue queues, while attempting to balance them. Instruction wake-up
> circuitry and operand forwarding paths with high fan-outs that threatens frequency and place & route. Register
> file reads and writes ports logic that increase the access time.
> There are, of course, the decode stage and load data paths that become contention
> points. However, the present micro-architecture does not address them.
>
> Here find the >>> link <<< to the image depicting the micro-architecture.
>
> It relies on a banked register file, that is architecturally invisible. Oppositely
> to most clustered designs, where the wake-up latency is longer when the instruction
> depends on another that is executed on the opposite cluster, this one is uniform.
> The physical register file banks are named "O" for odd and "E" for even. In the depicted design,
> there are 6 pairs of rename lanes. Each rename lane can handle a micro-op with up to two source
> operands, and one destination operand. Each cycle, six free physical registers are selected from
> the odd bank, six from the even one, and one of each is given to each rename lanes pair. (This
> rule is important as it further eases dispatch.) The coupling of an odd or even physical register
> is made randomly, to avoid further bandwidth degradation due to specific patterns.
> As each rename lanes pair can output at most two instructions, each exclusively bounded to a different
> bank with its destination physical register, we know 6 instructions will write odd register file,
> and 6 the even one. Instructions are thus selectively presented to one of the two, 6-wide dispatch
> lanes. The dispatch lanes feed 4 of the 8 dispatch buffers, each one handling a unique combination
> of bank for the two sources and the destination operands of the instructions it can receive.
> Dispatch buffer are FIFO-acting. Their goal is, as in many designs, to absorb and smoothen the
> high and irregular dispatch bandwidth (up to 6 instructions targeting a dispatch buffer) before
> the issue queues. It also provides a (relatively) low-cost storage to maximize issue queues
> fill ratio. Instructions are finally dispatched to one of two issue queues, in-order.
> Each issue queue tracks up to two source and one destination operands, again, each bounded to
> a physical register file bank. Each one has two issue ports: one targeting an execution unit
> for two source and one destination operands instructions, and one either for one source and
> one destination operands instruction (such as loads) or for two source operands instructions
> (such as stores). The pattern is such that two neighbor issue queues share an execution unit.
> This permit great balancing between the two issue queues fed by the same dispatch buffer.
> Finally, the instruction to be executed is granted from one of the
> two issue queues that have selected an instruction for that unit.
>
> This micro-architecture provides some benefits:
>A high execution bandwidth can be reached, with half the number of read and write ports to each physical
> register file bank, and half the number of wake-up and forward paths, as what would have been without
> the banking. It ensures ease of dispatch, and relatively high occupancy in issue queues.
> It has however many shortcomings:
>Though the rename bandwidth can be changed (while still being a multiple of 2), the number of dispatch buffers
> and issue queues, and number and types of execution units cannot. It does not work for instructions that
> have three or more source operands, or two or more destination operands. Instructions have no freedom on
> the dispatch buffer, and nearly none on the issue queue, to go. This can lead to unbalancing if one architectural
> register if read many times before being overwritten (as the stack pointer), as it will tend to move instruction
> to half of the dispatch buffers. Which port should execute branches and stores is not well defined. Loads
> can surely be on those with one source, one destination operand, and that'll roughly match the ratio for
> most workload. On the other hand, if branches and stores are together issued through the two source operand
> port, then they'll become a bottleneck. And if branches are issued alongside ALU instructions (with two source
> and one destination operand ports), then they'll become the bottleneck.
> In conclusion, this design is not perfect. But it enables high execution bandwidth without the cost, and with
> few constrains. One downside of traditional clustered architectures is the non-uniform wake-up and forward
> delay between physical register file banks. For those designs, the hardware must ideally and intelligently
> split the workload on the two clusters, such that little traffic exists between the two. Here, there is no
> such delay, though the dispatch constrains can be considered as high. The design is not intended to solve
> either the instruction fetch, decode and load-pipe issues that arise with wide superscalar processors.
>
> Hugo Décharnes.
I think the Alpha 21264 scheme sounds rather similar except it wasn't random, it depended on whether the original register number was odd or even.
> Following a previous thread on register banking, I would like to share
> a micro-architecture I have been thinking about for quite some time.
>
> Under Section 107 of the Copyright Act in 1976, allowance is made for "fair use" for purposes
> such as criticism, comment, news reporting, teaching, scholarship, and research.
>
> When widening the execution bandwidth of an out-of-order processor, some structures become contented:
>
> can come from other renamed (older in program order) instruction. This is even worse with move elimination,
> as for each instruction, its destination physical register tag can come from its source operand, then
> driven as input of source operand collection of further instructions, thus creating a dependence chain.
> (This is the reason why the move elimination depth is often limited.)
> has to steer to a higher number of issue queues, while attempting to balance them.
> circuitry and operand forwarding paths with high fan-outs that threatens frequency and place & route.
> file reads and writes ports logic that increase the access time.
> There are, of course, the decode stage and load data paths that become contention
> points. However, the present micro-architecture does not address them.
>
> Here find the >>> link <<< to the image depicting the micro-architecture.
>
> It relies on a banked register file, that is architecturally invisible. Oppositely
> to most clustered designs, where the wake-up latency is longer when the instruction
> depends on another that is executed on the opposite cluster, this one is uniform.
> The physical register file banks are named "O" for odd and "E" for even. In the depicted design,
> there are 6 pairs of rename lanes. Each rename lane can handle a micro-op with up to two source
> operands, and one destination operand. Each cycle, six free physical registers are selected from
> the odd bank, six from the even one, and one of each is given to each rename lanes pair. (This
> rule is important as it further eases dispatch.) The coupling of an odd or even physical register
> is made randomly, to avoid further bandwidth degradation due to specific patterns.
> As each rename lanes pair can output at most two instructions, each exclusively bounded to a different
> bank with its destination physical register, we know 6 instructions will write odd register file,
> and 6 the even one. Instructions are thus selectively presented to one of the two, 6-wide dispatch
> lanes. The dispatch lanes feed 4 of the 8 dispatch buffers, each one handling a unique combination
> of bank for the two sources and the destination operands of the instructions it can receive.
> Dispatch buffer are FIFO-acting. Their goal is, as in many designs, to absorb and smoothen the
> high and irregular dispatch bandwidth (up to 6 instructions targeting a dispatch buffer) before
> the issue queues. It also provides a (relatively) low-cost storage to maximize issue queues
> fill ratio. Instructions are finally dispatched to one of two issue queues, in-order.
> Each issue queue tracks up to two source and one destination operands, again, each bounded to
> a physical register file bank. Each one has two issue ports: one targeting an execution unit
> for two source and one destination operands instructions, and one either for one source and
> one destination operands instruction (such as loads) or for two source operands instructions
> (such as stores). The pattern is such that two neighbor issue queues share an execution unit.
> This permit great balancing between the two issue queues fed by the same dispatch buffer.
> Finally, the instruction to be executed is granted from one of the
> two issue queues that have selected an instruction for that unit.
>
> This micro-architecture provides some benefits:
>
> register file bank, and half the number of wake-up and forward paths, as what would have been without
> the banking.
> It has however many shortcomings:
>
> and issue queues, and number and types of execution units cannot.
> have three or more source operands, or two or more destination operands.
> the dispatch buffer, and nearly none on the issue queue, to go. This can lead to unbalancing if one architectural
> register if read many times before being overwritten (as the stack pointer), as it will tend to move instruction
> to half of the dispatch buffers.
> can surely be on those with one source, one destination operand, and that'll roughly match the ratio for
> most workload. On the other hand, if branches and stores are together issued through the two source operand
> port, then they'll become a bottleneck. And if branches are issued alongside ALU instructions (with two source
> and one destination operand ports), then they'll become the bottleneck.
> In conclusion, this design is not perfect. But it enables high execution bandwidth without the cost, and with
> few constrains. One downside of traditional clustered architectures is the non-uniform wake-up and forward
> delay between physical register file banks. For those designs, the hardware must ideally and intelligently
> split the workload on the two clusters, such that little traffic exists between the two. Here, there is no
> such delay, though the dispatch constrains can be considered as high. The design is not intended to solve
> either the instruction fetch, decode and load-pipe issues that arise with wide superscalar processors.
>
> Hugo Décharnes.
I think the Alpha 21264 scheme sounds rather similar except it wasn't random, it depended on whether the original register number was odd or even.
Topic | Posted By | Date |
---|---|---|
Very-large superscalar execution without the cost | Hugo Décharnes | 2021/08/18 10:34 AM |
Very-large superscalar execution without the cost | dmcq | 2021/08/18 02:56 PM |
Very-large superscalar execution without the cost | Hugo Décharnes | 2021/08/19 01:33 AM |
Very-large superscalar execution without the cost | anon | 2021/08/19 08:15 AM |
Very-large superscalar execution without the cost | Hugo Décharnes | 2021/08/19 08:34 AM |