Very-large superscalar execution without the cost

By: Hugo Décharnes (, August 18, 2021 10:34 am
Room: Moderated Discussions
Following a previous thread on register banking, I would like to share a micro-architecture I have been thinking about for quite some time.

Under Section 107 of the Copyright Act in 1976, allowance is made for "fair use" for purposes such as criticism, comment, news reporting, teaching, scholarship, and research.

When widening the execution bandwidth of an out-of-order processor, some structures become contented:
  • Register renaming logic, as for each renamed instruction, its source operand physical register tag can come from other renamed (older in program order) instruction. This is even worse with move elimination, as for each instruction, its destination physical register tag can come from its source operand, then driven as input of source operand collection of further instructions, thus creating a dependence chain. (This is the reason why the move elimination depth is often limited.)
  • Instruction dispatch logic, that has to steer to a higher number of issue queues, while attempting to balance them.
  • Instruction wake-up circuitry and operand forwarding paths with high fan-outs that threatens frequency and place & route.
  • Register file reads and writes ports logic that increase the access time.

  • There are, of course, the decode stage and load data paths that become contention points. However, the present micro-architecture does not address them.

    Here find the >>> link <<< to the image depicting the micro-architecture.

    It relies on a banked register file, that is architecturally invisible. Oppositely to most clustered designs, where the wake-up latency is longer when the instruction depends on another that is executed on the opposite cluster, this one is uniform.
    The physical register file banks are named "O" for odd and "E" for even. In the depicted design, there are 6 pairs of rename lanes. Each rename lane can handle a micro-op with up to two source operands, and one destination operand. Each cycle, six free physical registers are selected from the odd bank, six from the even one, and one of each is given to each rename lanes pair. (This rule is important as it further eases dispatch.) The coupling of an odd or even physical register is made randomly, to avoid further bandwidth degradation due to specific patterns.
    As each rename lanes pair can output at most two instructions, each exclusively bounded to a different bank with its destination physical register, we know 6 instructions will write odd register file, and 6 the even one. Instructions are thus selectively presented to one of the two, 6-wide dispatch lanes. The dispatch lanes feed 4 of the 8 dispatch buffers, each one handling a unique combination of bank for the two sources and the destination operands of the instructions it can receive.
    Dispatch buffer are FIFO-acting. Their goal is, as in many designs, to absorb and smoothen the high and irregular dispatch bandwidth (up to 6 instructions targeting a dispatch buffer) before the issue queues. It also provides a (relatively) low-cost storage to maximize issue queues fill ratio. Instructions are finally dispatched to one of two issue queues, in-order.
    Each issue queue tracks up to two source and one destination operands, again, each bounded to a physical register file bank. Each one has two issue ports: one targeting an execution unit for two source and one destination operands instructions, and one either for one source and one destination operands instruction (such as loads) or for two source operands instructions (such as stores). The pattern is such that two neighbor issue queues share an execution unit. This permit great balancing between the two issue queues fed by the same dispatch buffer.
    Finally, the instruction to be executed is granted from one of the two issue queues that have selected an instruction for that unit.

    This micro-architecture provides some benefits:
  • A high execution bandwidth can be reached, with half the number of read and write ports to each physical register file bank, and half the number of wake-up and forward paths, as what would have been without the banking.
  • It ensures ease of dispatch, and relatively high occupancy in issue queues.

  • It has however many shortcomings:
  • Though the rename bandwidth can be changed (while still being a multiple of 2), the number of dispatch buffers and issue queues, and number and types of execution units cannot.
  • It does not work for instructions that have three or more source operands, or two or more destination operands.
  • Instructions have no freedom on the dispatch buffer, and nearly none on the issue queue, to go. This can lead to unbalancing if one architectural register if read many times before being overwritten (as the stack pointer), as it will tend to move instruction to half of the dispatch buffers.
  • Which port should execute branches and stores is not well defined. Loads can surely be on those with one source, one destination operand, and that'll roughly match the ratio for most workload. On the other hand, if branches and stores are together issued through the two source operand port, then they'll become a bottleneck. And if branches are issued alongside ALU instructions (with two source and one destination operand ports), then they'll become the bottleneck.

  • In conclusion, this design is not perfect. But it enables high execution bandwidth without the cost, and with few constrains. One downside of traditional clustered architectures is the non-uniform wake-up and forward delay between physical register file banks. For those designs, the hardware must ideally and intelligently split the workload on the two clusters, such that little traffic exists between the two. Here, there is no such delay, though the dispatch constrains can be considered as high. The design is not intended to solve either the instruction fetch, decode and load-pipe issues that arise with wide superscalar processors.

    Hugo Décharnes.
     Next Post in Thread >
    TopicPosted ByDate
    Very-large superscalar execution without the costHugo Décharnes2021/08/18 10:34 AM
      Very-large superscalar execution without the costdmcq2021/08/18 02:56 PM
        Very-large superscalar execution without the costHugo Décharnes2021/08/19 01:33 AM
          Very-large superscalar execution without the costanon2021/08/19 08:15 AM
            Very-large superscalar execution without the costHugo Décharnes2021/08/19 08:34 AM
    Reply to this Topic
    Body: No Text
    How do you spell tangerine? 🍊