By: , May 30, 2013 8:48 am
Room: Moderated Discussions
Ricardo B (ricardo.b.delete@this.xxxxx.xx) on May 30, 2013 8:05 am wrote:
> Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on May 29, 2013 10:16 pm wrote:
>
> > - Good to know that I have Haswell's scheduling system basically understood. However, I have one question;
> > Haswell has four ports, yet somewhere in Haswell's article, it said that the scheduler can dispatch 8
> > uops per clk in ideal conditions, whilst Sandy Bridge, with it's 6 ports, was only allowed 6 uops per
> > clk in ideal conditions. So I understand most of what is being said, but what I don't understand is that
> > 4 ports = 8 uops and 3 ports = 6 uops. It seems that 2 uops can be sent through one port per clk? Though
> > I don't believe this is mentioned anywhere in the article. This confuses me; if you don't mind, could
> > you please tell me why it seems that the ports have double the width they are made out to have?
>
> Maybe it's not clear from David's diagrams, but it's explicitly written in the text: Haswell has 8 ports.
>
>
> >
> > - Your Silvermont link gives me a good idea more or less
> > of what's going on. If I have a decent understanding
> > of things, though some of the things you say confuse me a bit. What I understand from this diagram is that:
> > The data prefetcher requests data, and it looks for this data in the L2 cache/L3 cache/RAM using the DTLB,
> > where this data flows over to the store buffers and into the data cache for the execution units to use?
> >
> > A few things confuse me; and sorry for the mass of questions, I appreciate the help.
> > - You say that only the path from L1 cache to the registers is shown, yet I
> > see a path from L2 to registers. Am I wrong somewhere? Also, I do not see the
> > RSV stations in the core diagram; perhaps they are omitted to save space?
> > - Is it me, or does the labelled store buffer actually serve
> > the function of a load buffer and a store buffer?
> > - The AGU generates tags for the location of where data is stored in the caches
> > and memory for use in the DTLB by flowing completed instructions to the ROB to notify
> > the TLB of changes in data position or state. Is my understanding correct?
>
> Wow. Nope.
> First, put aside the prefetcher and the store buffers.
>
> The AGU computes a _logical address_.
> Then the DTLB computes a _physical address_, along with other bits
> of information, by performing a look up in the page table.
> All that is fed to the L1 D$.
> If it doesn't have the cache line, the L1 D$ will fetch it from the L2 $.
> If L1 D$ does have the line, it will make the requested data available.
> (I really see no path between the registers and the L2 $, by the way)
>
> The store buffer is used to buffer stores, before they're actually commited to a cache line.
> The memory logic also checks the contents of the store buffer when performing a load.
>
> Weather it comes from the L1 data cache or the store buffer, information is then routed to the registers
> via some data path (the think line going to the Integer Rename Buffers in Extremetech's diagram).
>
> The prefetcher is an auxiliary, which tracks memory access patterns and prefetches
> caches lines which haven't been requested but (it hopes) will be needed soon.
>
Thanks again as always!
- Oh, alright I understand about Haswell's port number. I'm a very visual learner, and take diagrams quite literally (perhaps this is a flaw of mine; to read too far into them). If I were to apply these additional four porst to the diagram, would they line up alongside other ports? Or be in entirely different locations?
- So it seems to me that the AGU almost computes a "tag" for the CPU to use to refer themselves to the actual data entires, whilst the DTLB holds the actual physical location of these data entries? Or do I still have something wrong? The AGU seems very ambigious to me... Or perhaps the AGU is the unit that requests data to be fed to the other execution units? Recieving instructions from the scheduler to request certain data entries, which moves onto the DTLB which moves to the L1/L2?
- So the store buffer is very unclear to me. Your explanation probably makes perfect sense, it's just me who probably doesnt understand. Though what I don't understand is; where do these store requests come from? I understand that the store buffer holds data before it is commited to the data caches, but where does this data come from? The execution units? If so, why would it be redirected back to the data caches if they were already "done" with?
- So the prefetcher sortof acts like the data part of a loop detector? Or other patterns?
Thanks for baring with me, really appreciate it as always. Learning plenty!
> Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on May 29, 2013 10:16 pm wrote:
>
> > - Good to know that I have Haswell's scheduling system basically understood. However, I have one question;
> > Haswell has four ports, yet somewhere in Haswell's article, it said that the scheduler can dispatch 8
> > uops per clk in ideal conditions, whilst Sandy Bridge, with it's 6 ports, was only allowed 6 uops per
> > clk in ideal conditions. So I understand most of what is being said, but what I don't understand is that
> > 4 ports = 8 uops and 3 ports = 6 uops. It seems that 2 uops can be sent through one port per clk? Though
> > I don't believe this is mentioned anywhere in the article. This confuses me; if you don't mind, could
> > you please tell me why it seems that the ports have double the width they are made out to have?
>
> Maybe it's not clear from David's diagrams, but it's explicitly written in the text: Haswell has 8 ports.
>
>
> >
> > - Your Silvermont link gives me a good idea more or less
> > of what's going on. If I have a decent understanding
> > of things, though some of the things you say confuse me a bit. What I understand from this diagram is that:
> > The data prefetcher requests data, and it looks for this data in the L2 cache/L3 cache/RAM using the DTLB,
> > where this data flows over to the store buffers and into the data cache for the execution units to use?
> >
> > A few things confuse me; and sorry for the mass of questions, I appreciate the help.
> > - You say that only the path from L1 cache to the registers is shown, yet I
> > see a path from L2 to registers. Am I wrong somewhere? Also, I do not see the
> > RSV stations in the core diagram; perhaps they are omitted to save space?
> > - Is it me, or does the labelled store buffer actually serve
> > the function of a load buffer and a store buffer?
> > - The AGU generates tags for the location of where data is stored in the caches
> > and memory for use in the DTLB by flowing completed instructions to the ROB to notify
> > the TLB of changes in data position or state. Is my understanding correct?
>
> Wow. Nope.
> First, put aside the prefetcher and the store buffers.
>
> The AGU computes a _logical address_.
> Then the DTLB computes a _physical address_, along with other bits
> of information, by performing a look up in the page table.
> All that is fed to the L1 D$.
> If it doesn't have the cache line, the L1 D$ will fetch it from the L2 $.
> If L1 D$ does have the line, it will make the requested data available.
> (I really see no path between the registers and the L2 $, by the way)
>
> The store buffer is used to buffer stores, before they're actually commited to a cache line.
> The memory logic also checks the contents of the store buffer when performing a load.
>
> Weather it comes from the L1 data cache or the store buffer, information is then routed to the registers
> via some data path (the think line going to the Integer Rename Buffers in Extremetech's diagram).
>
> The prefetcher is an auxiliary, which tracks memory access patterns and prefetches
> caches lines which haven't been requested but (it hopes) will be needed soon.
>
Thanks again as always!
- Oh, alright I understand about Haswell's port number. I'm a very visual learner, and take diagrams quite literally (perhaps this is a flaw of mine; to read too far into them). If I were to apply these additional four porst to the diagram, would they line up alongside other ports? Or be in entirely different locations?
- So it seems to me that the AGU almost computes a "tag" for the CPU to use to refer themselves to the actual data entires, whilst the DTLB holds the actual physical location of these data entries? Or do I still have something wrong? The AGU seems very ambigious to me... Or perhaps the AGU is the unit that requests data to be fed to the other execution units? Recieving instructions from the scheduler to request certain data entries, which moves onto the DTLB which moves to the L1/L2?
- So the store buffer is very unclear to me. Your explanation probably makes perfect sense, it's just me who probably doesnt understand. Though what I don't understand is; where do these store requests come from? I understand that the store buffer holds data before it is commited to the data caches, but where does this data come from? The execution units? If so, why would it be redirected back to the data caches if they were already "done" with?
- So the prefetcher sortof acts like the data part of a loop detector? Or other patterns?
Thanks for baring with me, really appreciate it as always. Learning plenty!