Very-large superscalar execution without the cost

By: dmcq (dmcq.delete@this.fano.co.uk), August 18, 2021 2:56 pm
Room: Moderated Discussions
Hugo Décharnes (hdecharn.delete@this.outlook.fr) on August 18, 2021 10:34 am wrote:
> Following a previous thread on register banking, I would like to share
> a micro-architecture I have been thinking about for quite some time.
>
> Under Section 107 of the Copyright Act in 1976, allowance is made for "fair use" for purposes
> such as criticism, comment, news reporting, teaching, scholarship, and research.

>
> When widening the execution bandwidth of an out-of-order processor, some structures become contented:
>
  • Register renaming logic, as for each renamed instruction, its source operand physical register tag
    > can come from other renamed (older in program order) instruction. This is even worse with move elimination,
    > as for each instruction, its destination physical register tag can come from its source operand, then
    > driven as input of source operand collection of further instructions, thus creating a dependence chain.
    > (This is the reason why the move elimination depth is often limited.)
  • Instruction dispatch logic, that
    > has to steer to a higher number of issue queues, while attempting to balance them.
  • Instruction wake-up
    > circuitry and operand forwarding paths with high fan-outs that threatens frequency and place & route.
  • Register
    > file reads and writes ports logic that increase the access time.

  • > There are, of course, the decode stage and load data paths that become contention
    > points. However, the present micro-architecture does not address them.
    >
    > Here find the >>> link <<< to the image depicting the micro-architecture.
    >
    > It relies on a banked register file, that is architecturally invisible. Oppositely
    > to most clustered designs, where the wake-up latency is longer when the instruction
    > depends on another that is executed on the opposite cluster, this one is uniform.
    > The physical register file banks are named "O" for odd and "E" for even. In the depicted design,
    > there are 6 pairs of rename lanes. Each rename lane can handle a micro-op with up to two source
    > operands, and one destination operand. Each cycle, six free physical registers are selected from
    > the odd bank, six from the even one, and one of each is given to each rename lanes pair. (This
    > rule is important as it further eases dispatch.) The coupling of an odd or even physical register
    > is made randomly, to avoid further bandwidth degradation due to specific patterns.
    > As each rename lanes pair can output at most two instructions, each exclusively bounded to a different
    > bank with its destination physical register, we know 6 instructions will write odd register file,
    > and 6 the even one. Instructions are thus selectively presented to one of the two, 6-wide dispatch
    > lanes. The dispatch lanes feed 4 of the 8 dispatch buffers, each one handling a unique combination
    > of bank for the two sources and the destination operands of the instructions it can receive.
    > Dispatch buffer are FIFO-acting. Their goal is, as in many designs, to absorb and smoothen the
    > high and irregular dispatch bandwidth (up to 6 instructions targeting a dispatch buffer) before
    > the issue queues. It also provides a (relatively) low-cost storage to maximize issue queues
    > fill ratio. Instructions are finally dispatched to one of two issue queues, in-order.
    > Each issue queue tracks up to two source and one destination operands, again, each bounded to
    > a physical register file bank. Each one has two issue ports: one targeting an execution unit
    > for two source and one destination operands instructions, and one either for one source and
    > one destination operands instruction (such as loads) or for two source operands instructions
    > (such as stores). The pattern is such that two neighbor issue queues share an execution unit.
    > This permit great balancing between the two issue queues fed by the same dispatch buffer.
    > Finally, the instruction to be executed is granted from one of the
    > two issue queues that have selected an instruction for that unit.
    >
    > This micro-architecture provides some benefits:
    >
  • A high execution bandwidth can be reached, with half the number of read and write ports to each physical
    > register file bank, and half the number of wake-up and forward paths, as what would have been without
    > the banking.
  • It ensures ease of dispatch, and relatively high occupancy in issue queues.

  • > It has however many shortcomings:
    >
  • Though the rename bandwidth can be changed (while still being a multiple of 2), the number of dispatch buffers
    > and issue queues, and number and types of execution units cannot.
  • It does not work for instructions that
    > have three or more source operands, or two or more destination operands.
  • Instructions have no freedom on
    > the dispatch buffer, and nearly none on the issue queue, to go. This can lead to unbalancing if one architectural
    > register if read many times before being overwritten (as the stack pointer), as it will tend to move instruction
    > to half of the dispatch buffers.
  • Which port should execute branches and stores is not well defined. Loads
    > can surely be on those with one source, one destination operand, and that'll roughly match the ratio for
    > most workload. On the other hand, if branches and stores are together issued through the two source operand
    > port, then they'll become a bottleneck. And if branches are issued alongside ALU instructions (with two source
    > and one destination operand ports), then they'll become the bottleneck.

  • > In conclusion, this design is not perfect. But it enables high execution bandwidth without the cost, and with
    > few constrains. One downside of traditional clustered architectures is the non-uniform wake-up and forward
    > delay between physical register file banks. For those designs, the hardware must ideally and intelligently
    > split the workload on the two clusters, such that little traffic exists between the two. Here, there is no
    > such delay, though the dispatch constrains can be considered as high. The design is not intended to solve
    > either the instruction fetch, decode and load-pipe issues that arise with wide superscalar processors.
    >
    > Hugo Décharnes.

    I think the Alpha 21264 scheme sounds rather similar except it wasn't random, it depended on whether the original register number was odd or even.
    < Previous Post in ThreadNext Post in Thread >
    TopicPosted ByDate
    Very-large superscalar execution without the costHugo Décharnes2021/08/18 10:34 AM
      Very-large superscalar execution without the costdmcq2021/08/18 02:56 PM
        Very-large superscalar execution without the costHugo Décharnes2021/08/19 01:33 AM
          Very-large superscalar execution without the costanon2021/08/19 08:15 AM
            Very-large superscalar execution without the costHugo Décharnes2021/08/19 08:34 AM
    Reply to this Topic
    Name:
    Email:
    Topic:
    Body: No Text
    How do you spell tangerine? 🍊