By: rwessel (robertwessel.delete@this.yahoo.com), May 31, 2013 8:53 pm
Room: Moderated Discussions
Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on May 31, 2013 9:04 pm wrote:
> David Kanter (dkanter.delete@this.realworldtech.com) on May 31, 2013 3:08 pm wrote:
> > Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on May 31, 2013 2:22 pm wrote:
> > > Ricardo B (ricardo.b.delete@this.xxxxx.xx) on May 31, 2013 12:22 pm wrote:
> > > > Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on May 31, 2013 6:59 am wrote:
> > > >
> > > > > The array example definitely helped. Though one more about AGU's; say a AGU is given an
> > > > > instruction to calculate the virtual address of (array+4), and it does so successfully.
> > > > > Where would the result of the virtual address be stored, and how would it be used?
> > > >
> > > > Usually, the virtual address calculated by the AGU is forwarded directly
> > > > to the DTLB and L1 D$ to be used as an address for a store/load.
> > > >
> > > > It's also possible to store the address calculated by the AGU into a register for other uses.
> > > >
> > > > > - Sorry, I still don't quite understand how multi-threading works. If there are two programs,
> > > > > one of each using one thread, how does a single execution unit perform as two?
> > > >
> > > > For example, while an Ivy Bridge CPU can in theory sustain 6 µOPs per clock, in practice most
> > > > software runs at ~1 µOP per clock, due to instruction dependencies and other things.
> > > >
> > > > There is thus, lots of free time to execute instructions for a second thread.
> > > >
> > > >
> > >
> > > Thanks again for the reply!
> > >
> > > - Oh, so the AGU is sort of like a "decoder" for store and load operations? So hopefully my understanding
> > > is correct now;
> >
> > It's not a decoder. It calculates the address, based on the instructions used in the code. Instructions
> > have very different addressing modes, and the AGU needs to be able to handle all of them. Most address
> > calculations are simple, but they can be quite complex, involving multiplication and addition.
> >
> > > The scheduler gives a location for the AGU to decode, which can either be simple
> > > (EDX1) or complex (EDX2+2/5^65%4),
> >
> > It's specified by an instruction, not the scheduler. It might get held in the scheduler temporarily.
> > Honestly, you might find it beneficial to read some of my other
> > articles, such as http://www.realworldtech.com/barcelona/
> > they have a good narration of the pipeline.
> >
> > >and once it figures out what the virtual address of this location
> > > is, it sends the request to the DTLB, which performs a look up of this virtual address, which finds
> > > where the data is,
> >
> > That's correct, the DTLB converts from virtual to physical addresses.
> >
> > >and then requests it from wherever it is through the caches into the data cache
> > > for the execution units to utilize? Hopefully I've gotten it right by now...
> >
> > It requests that address from the cache and loads it into a register. Which is why we call it a load.
> >
> > Stores work rather differently, since it is moving data from a register to the memory.
> >
> > David
>
> Hello David, thanks for taking the time to reply to my post; I definitely appreciate it!
>
> - I know that the AGU is not a decoder; though for some reason, my mind likes to think of the equation
> that the virtual addresses come in are "in code" and that the AGU clarifies where the destination really
> is. Though, to my understanding; it is really like an ALU that takes the request for a virtual address
> and then clarifies where this address is (not physically, but virtually.) Hopefully this is correct.
>
> - Ah, to clarify what I was trying to say previously; I mean to say that the instruction declares what
> operands it needs, and requests the operands needed by the AGU, where the AGU calculates the requests
> for these operands, and then the DTLB looks up the virtual address to find out where the data is in
> physical memory, (and here is where my understanding gets shaky) which once the address is found (through
> whatever cache escalations is needed), the operands are sent through the load/store units, through
> the AGU and back into the scheduler for the instruction to be passed onto the ALU to calculate? Or
> atleast thats how Barcelona looks like; though I'm sure I'm wrong here; why would operands be sent
> through an AGU?
The operand itself would never go through the AGU, rather the generated address is passed to the load/store unit (often the AGU is part of that), and once the load completes, the operand is forwarded to the execution unit where the dispatched instruction is waiting for it. Exactly how that happens depends a great deal on the microarchitecture. On a simple in-order design the pipeline may simply be stalled waiting for the load unit to present the single outstanding operand. In an OoO design, a rather more complicated network will exist to get the operand to the instruction (micro-op) needing it, where ever that's happened to end up waiting. Once the instruction has all of its operands, it can then be executed.
Things are a bit different on the store side, as the operand is usually available pretty early (it's a store, after all, you already have the operand), and the store, along with the address, can but pushed quickly into the store buffer, and that can then complete independently of the instruction stream. A critical issue is maintaining a coherent and properly sequential view of memory even if the actual (physical) stores and loads are not happening in the architected order. It's not so bad within a single processor, since the store buffer can watch for other memory accesses, and jump in when it has a pending store. But since memory accesses are visible to other processors (and I/O devices), great effort must be taken to ensure that those other devices only see memory accesses in the architected order, or you'll break every multithreaded program in sight.
> David Kanter (dkanter.delete@this.realworldtech.com) on May 31, 2013 3:08 pm wrote:
> > Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on May 31, 2013 2:22 pm wrote:
> > > Ricardo B (ricardo.b.delete@this.xxxxx.xx) on May 31, 2013 12:22 pm wrote:
> > > > Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on May 31, 2013 6:59 am wrote:
> > > >
> > > > > The array example definitely helped. Though one more about AGU's; say a AGU is given an
> > > > > instruction to calculate the virtual address of (array+4), and it does so successfully.
> > > > > Where would the result of the virtual address be stored, and how would it be used?
> > > >
> > > > Usually, the virtual address calculated by the AGU is forwarded directly
> > > > to the DTLB and L1 D$ to be used as an address for a store/load.
> > > >
> > > > It's also possible to store the address calculated by the AGU into a register for other uses.
> > > >
> > > > > - Sorry, I still don't quite understand how multi-threading works. If there are two programs,
> > > > > one of each using one thread, how does a single execution unit perform as two?
> > > >
> > > > For example, while an Ivy Bridge CPU can in theory sustain 6 µOPs per clock, in practice most
> > > > software runs at ~1 µOP per clock, due to instruction dependencies and other things.
> > > >
> > > > There is thus, lots of free time to execute instructions for a second thread.
> > > >
> > > >
> > >
> > > Thanks again for the reply!
> > >
> > > - Oh, so the AGU is sort of like a "decoder" for store and load operations? So hopefully my understanding
> > > is correct now;
> >
> > It's not a decoder. It calculates the address, based on the instructions used in the code. Instructions
> > have very different addressing modes, and the AGU needs to be able to handle all of them. Most address
> > calculations are simple, but they can be quite complex, involving multiplication and addition.
> >
> > > The scheduler gives a location for the AGU to decode, which can either be simple
> > > (EDX1) or complex (EDX2+2/5^65%4),
> >
> > It's specified by an instruction, not the scheduler. It might get held in the scheduler temporarily.
> > Honestly, you might find it beneficial to read some of my other
> > articles, such as http://www.realworldtech.com/barcelona/
> > they have a good narration of the pipeline.
> >
> > >and once it figures out what the virtual address of this location
> > > is, it sends the request to the DTLB, which performs a look up of this virtual address, which finds
> > > where the data is,
> >
> > That's correct, the DTLB converts from virtual to physical addresses.
> >
> > >and then requests it from wherever it is through the caches into the data cache
> > > for the execution units to utilize? Hopefully I've gotten it right by now...
> >
> > It requests that address from the cache and loads it into a register. Which is why we call it a load.
> >
> > Stores work rather differently, since it is moving data from a register to the memory.
> >
> > David
>
> Hello David, thanks for taking the time to reply to my post; I definitely appreciate it!
>
> - I know that the AGU is not a decoder; though for some reason, my mind likes to think of the equation
> that the virtual addresses come in are "in code" and that the AGU clarifies where the destination really
> is. Though, to my understanding; it is really like an ALU that takes the request for a virtual address
> and then clarifies where this address is (not physically, but virtually.) Hopefully this is correct.
>
> - Ah, to clarify what I was trying to say previously; I mean to say that the instruction declares what
> operands it needs, and requests the operands needed by the AGU, where the AGU calculates the requests
> for these operands, and then the DTLB looks up the virtual address to find out where the data is in
> physical memory, (and here is where my understanding gets shaky) which once the address is found (through
> whatever cache escalations is needed), the operands are sent through the load/store units, through
> the AGU and back into the scheduler for the instruction to be passed onto the ALU to calculate? Or
> atleast thats how Barcelona looks like; though I'm sure I'm wrong here; why would operands be sent
> through an AGU?
The operand itself would never go through the AGU, rather the generated address is passed to the load/store unit (often the AGU is part of that), and once the load completes, the operand is forwarded to the execution unit where the dispatched instruction is waiting for it. Exactly how that happens depends a great deal on the microarchitecture. On a simple in-order design the pipeline may simply be stalled waiting for the load unit to present the single outstanding operand. In an OoO design, a rather more complicated network will exist to get the operand to the instruction (micro-op) needing it, where ever that's happened to end up waiting. Once the instruction has all of its operands, it can then be executed.
Things are a bit different on the store side, as the operand is usually available pretty early (it's a store, after all, you already have the operand), and the store, along with the address, can but pushed quickly into the store buffer, and that can then complete independently of the instruction stream. A critical issue is maintaining a coherent and properly sequential view of memory even if the actual (physical) stores and loads are not happening in the architected order. It's not so bad within a single processor, since the store buffer can watch for other memory accesses, and jump in when it has a pending store. But since memory accesses are visible to other processors (and I/O devices), great effort must be taken to ensure that those other devices only see memory accesses in the architected order, or you'll break every multithreaded program in sight.