By: rwessel (robertwessel.delete@this.yahoo.com), June 3, 2013 12:09 am
Room: Moderated Discussions
Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on May 31, 2013 10:59 pm wrote:
> rwessel (robertwessel.delete@this.yahoo.com) on May 31, 2013 9:53 pm wrote:
> > The operand itself would never go through the AGU, rather the generated address is passed to the load/store
> > unit (often the AGU is part of that), and once the load
> > completes, the operand is forwarded to the execution
> > unit where the dispatched instruction is waiting for it. Exactly how that happens depends a great deal
> > on the microarchitecture. On a simple in-order design the pipeline may simply be stalled waiting for the
> > load unit to present the single outstanding operand. In an OoO design, a rather more complicated network
> > will exist to get the operand to the instruction (micro-op) needing it, where ever that's happened to end
> > up waiting. Once the instruction has all of its operands, it can then be executed.
> >
> > Things are a bit different on the store side, as the operand
> > is usually available pretty early (it's a store,
> > after all, you already have the operand), and the store,
> > along with the address, can but pushed quickly into
> > the store buffer, and that can then complete independently of
> > the instruction stream. A critical issue is maintaining
> > a coherent and properly sequential view of memory even if
> > the actual (physical) stores and loads are not happening
> > in the architected order. It's not so bad within a single
> > processor, since the store buffer can watch for other
> > memory accesses, and jump in when it has a pending store.
> > But since memory accesses are visible to other processors
> > (and I/O devices), great effort must be taken to ensure
> > that those other devices only see memory accesses in
> > the architected order, or you'll break every multithreaded program in sight.
>
> Thanks for the reply!
>
> - So there is a load/store unit that is included in the AGU's in the diagrams? Operands flow through these
> units? That makes sense; but one thing: how do these load/store units get the operands to the execution
> units? Are they directly linked or do they go through one of the other scheduler/buffers above?
>
> - Ah yes, stores seem to be rather straight forward considering the reasons you stated.
> Though just like the question above; how does the finished result of the instruction
> get to the store unit to store the result? Directly linked? Or another method?
The load/store units *are* types of execution units - they happen to handle loads and stores to memory, rather, that, say, arithmetic operations. Instructions get to them because the dispatcher sends them there. The store unit gets operands just like any other execution unit does, and the load unit sends results to registers, just like and arithmetic unit that generates results does.
Nominally, when an instruction is issued OoO, and before its operands are ready, what it's waiting on is a prior instruction to write its result to a register, which it can then read from the register file. The many rename registers on modern OoO processors are part of the mechanism to be able to execute around dependencies caused by register reuse. But even that isn't quite enough, the path between a result being stored into a register and a dependent instruction reading that result from the register is far too long in most cases, and most fast processors (and not just OoO ones), implement a forwarding network that allows results to be transmitted from one unit to another directly, and in parallel to the update of the register.
In the case of non-RISC machines, the operation of the load and store units is complicated by the fact that there are operations that are read-modify-update in nature. Exactly how those are handled is very dependent on the microarchitecture, but the earlier implementation all broke operations like "add memory,1" into several micro-ops (perhaps a load, an add and a store), more complex designs can compress that into fewer operations. Instructions that get split into multiple micro-ops need special handling on the back end, as you cannot architecturally take an exception in the “middle” of an instruction (and least on most machines).
> rwessel (robertwessel.delete@this.yahoo.com) on May 31, 2013 9:53 pm wrote:
> > The operand itself would never go through the AGU, rather the generated address is passed to the load/store
> > unit (often the AGU is part of that), and once the load
> > completes, the operand is forwarded to the execution
> > unit where the dispatched instruction is waiting for it. Exactly how that happens depends a great deal
> > on the microarchitecture. On a simple in-order design the pipeline may simply be stalled waiting for the
> > load unit to present the single outstanding operand. In an OoO design, a rather more complicated network
> > will exist to get the operand to the instruction (micro-op) needing it, where ever that's happened to end
> > up waiting. Once the instruction has all of its operands, it can then be executed.
> >
> > Things are a bit different on the store side, as the operand
> > is usually available pretty early (it's a store,
> > after all, you already have the operand), and the store,
> > along with the address, can but pushed quickly into
> > the store buffer, and that can then complete independently of
> > the instruction stream. A critical issue is maintaining
> > a coherent and properly sequential view of memory even if
> > the actual (physical) stores and loads are not happening
> > in the architected order. It's not so bad within a single
> > processor, since the store buffer can watch for other
> > memory accesses, and jump in when it has a pending store.
> > But since memory accesses are visible to other processors
> > (and I/O devices), great effort must be taken to ensure
> > that those other devices only see memory accesses in
> > the architected order, or you'll break every multithreaded program in sight.
>
> Thanks for the reply!
>
> - So there is a load/store unit that is included in the AGU's in the diagrams? Operands flow through these
> units? That makes sense; but one thing: how do these load/store units get the operands to the execution
> units? Are they directly linked or do they go through one of the other scheduler/buffers above?
>
> - Ah yes, stores seem to be rather straight forward considering the reasons you stated.
> Though just like the question above; how does the finished result of the instruction
> get to the store unit to store the result? Directly linked? Or another method?
The load/store units *are* types of execution units - they happen to handle loads and stores to memory, rather, that, say, arithmetic operations. Instructions get to them because the dispatcher sends them there. The store unit gets operands just like any other execution unit does, and the load unit sends results to registers, just like and arithmetic unit that generates results does.
Nominally, when an instruction is issued OoO, and before its operands are ready, what it's waiting on is a prior instruction to write its result to a register, which it can then read from the register file. The many rename registers on modern OoO processors are part of the mechanism to be able to execute around dependencies caused by register reuse. But even that isn't quite enough, the path between a result being stored into a register and a dependent instruction reading that result from the register is far too long in most cases, and most fast processors (and not just OoO ones), implement a forwarding network that allows results to be transmitted from one unit to another directly, and in parallel to the update of the register.
In the case of non-RISC machines, the operation of the load and store units is complicated by the fact that there are operations that are read-modify-update in nature. Exactly how those are handled is very dependent on the microarchitecture, but the earlier implementation all broke operations like "add memory,1" into several micro-ops (perhaps a load, an add and a store), more complex designs can compress that into fewer operations. Instructions that get split into multiple micro-ops need special handling on the back end, as you cannot architecturally take an exception in the “middle” of an instruction (and least on most machines).