By: , May 29, 2013 9:16 pm
Room: Moderated Discussions
Ricardo B (ricardo.b.delete@this.xxxxx.xx) on May 29, 2013 2:41 pm wrote:
> Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on May 28, 2013 4:27 pm wrote:
>
> > not being totally optimal; though, to my understanding, doesnt Haswell go the traditional route
> > of using the Unified Scheduler to send the instructions through ports directly to the ALU functions?
>
> Yes, it does.
> > I mean, I'm sure it's more than robust enough to have atleast one instruction to run through atleast
> > most of the time; but can't it only dispatch one instruction per clk? Or can it send dispatches
> > through any combination of open ports the second instructions become ready?
>
> The later.It can send an instruction to each port every clock.
> As long as there are instructions ready to be executed and suitable ports, of course.
>
> > The load unit? I've looked over the Silvermont article and found little mention of it-- maybe I'm
> > skipping over it or my brain just isnt comprehending it under a different name or something... Could
> > you please point me to the right direction as to where I could learn more about the load unit?
>
> Sorry, it's not named "load unit".
> It's the logic blocks below the 6 entry memory reservation station.
>
> Here's a diagram which may be a bit clearer for this purpose:
> http://www.extremetech.com/wp-content/uploads/2013/05/silvermont-core-block-diagram.jpg
>
> It only shows the path from the L1 data cache to the integer register
> renamer, but I guess it also goes to the FP/SIMD registers.
Thanks for the quick reply!
- Good to know that I have Haswell's scheduling system basically understood. However, I have one question; Haswell has four ports, yet somewhere in Haswell's article, it said that the scheduler can dispatch 8 uops per clk in ideal conditions, whilst Sandy Bridge, with it's 6 ports, was only allowed 6 uops per clk in ideal conditions. So I understand most of what is being said, but what I don't understand is that 4 ports = 8 uops and 3 ports = 6 uops. It seems that 2 uops can be sent through one port per clk? Though I don't believe this is mentioned anywhere in the article. This confuses me; if you don't mind, could you please tell me why it seems that the ports have double the width they are made out to have?
- Your Silvermont link gives me a good idea more or less of what's going on. If I have a decent understanding of things, though some of the things you say confuse me a bit. What I understand from this diagram is that: The data prefetcher requests data, and it looks for this data in the L2 cache/L3 cache/RAM using the DTLB, where this data flows over to the store buffers and into the data cache for the execution units to use?
A few things confuse me; and sorry for the mass of questions, I appreciate the help.
- You say that only the path from L1 cache to the registers is shown, yet I see a path from L2 to registers. Am I wrong somewhere? Also, I do not see the RSV stations in the core diagram; perhaps they are omitted to save space?
- Is it me, or does the labelled store buffer actually serve the function of a load buffer and a store buffer?
- The AGU generates tags for the location of where data is stored in the caches and memory for use in the DTLB by flowing completed instructions to the ROB to notify the TLB of changes in data position or state. Is my understanding correct?
Thank you for your time; I appreciate the tremendous help.
> Sebastian Soeiro (sebastian_2896.delete@this.hotmail.com) on May 28, 2013 4:27 pm wrote:
>
> > not being totally optimal; though, to my understanding, doesnt Haswell go the traditional route
> > of using the Unified Scheduler to send the instructions through ports directly to the ALU functions?
>
> Yes, it does.
> > I mean, I'm sure it's more than robust enough to have atleast one instruction to run through atleast
> > most of the time; but can't it only dispatch one instruction per clk? Or can it send dispatches
> > through any combination of open ports the second instructions become ready?
>
> The later.It can send an instruction to each port every clock.
> As long as there are instructions ready to be executed and suitable ports, of course.
>
> > The load unit? I've looked over the Silvermont article and found little mention of it-- maybe I'm
> > skipping over it or my brain just isnt comprehending it under a different name or something... Could
> > you please point me to the right direction as to where I could learn more about the load unit?
>
> Sorry, it's not named "load unit".
> It's the logic blocks below the 6 entry memory reservation station.
>
> Here's a diagram which may be a bit clearer for this purpose:
> http://www.extremetech.com/wp-content/uploads/2013/05/silvermont-core-block-diagram.jpg
>
> It only shows the path from the L1 data cache to the integer register
> renamer, but I guess it also goes to the FP/SIMD registers.
Thanks for the quick reply!
- Good to know that I have Haswell's scheduling system basically understood. However, I have one question; Haswell has four ports, yet somewhere in Haswell's article, it said that the scheduler can dispatch 8 uops per clk in ideal conditions, whilst Sandy Bridge, with it's 6 ports, was only allowed 6 uops per clk in ideal conditions. So I understand most of what is being said, but what I don't understand is that 4 ports = 8 uops and 3 ports = 6 uops. It seems that 2 uops can be sent through one port per clk? Though I don't believe this is mentioned anywhere in the article. This confuses me; if you don't mind, could you please tell me why it seems that the ports have double the width they are made out to have?
- Your Silvermont link gives me a good idea more or less of what's going on. If I have a decent understanding of things, though some of the things you say confuse me a bit. What I understand from this diagram is that: The data prefetcher requests data, and it looks for this data in the L2 cache/L3 cache/RAM using the DTLB, where this data flows over to the store buffers and into the data cache for the execution units to use?
A few things confuse me; and sorry for the mass of questions, I appreciate the help.
- You say that only the path from L1 cache to the registers is shown, yet I see a path from L2 to registers. Am I wrong somewhere? Also, I do not see the RSV stations in the core diagram; perhaps they are omitted to save space?
- Is it me, or does the labelled store buffer actually serve the function of a load buffer and a store buffer?
- The AGU generates tags for the location of where data is stored in the caches and memory for use in the DTLB by flowing completed instructions to the ROB to notify the TLB of changes in data position or state. Is my understanding correct?
Thank you for your time; I appreciate the tremendous help.