By: EduardoS (no.delete@this.spam.com), April 26, 2012 6:18 pm
Room: Moderated Discussions
Exophase (exophase@gmail.com) on 4/26/12 wrote:
---------------------------
>The point is that it can't be a very reliable document if it omits something so fundamental.
Indeed, AMD docs used to be better.
>Agner Fog's documentation indicates that the AGUs can perform loads without issuing
>to the EX ports, and David said that he doesn't see why this would not be the case
>- can you think of anything, outside of the questionable SOG suggesting it?
It's contradictory:
1) If there is no AGUx to AGUx forwarding than L1 latency tests would be hit by a penality and indicate more than 4 cycles, but they didn't;
2) Maybe the latency is 3 cycles and they report 4 because of the no forwarding penality, but then load-execute instructions with forwarding would take just 4 cycles, IIRC they take 5, Agner omits this part.
>Or maybe by memory subsystem you mean the load/store units themselves? Because
>that is indeed where BD has improved. If this is what you mean then I think we have a communication problem.
I mean the load/store units, L1, L2, L3, memory, prefetchers and everything that goes on the middle, everyone else seens to just look at L1 size, L2 latency, L3 latency and memory latency.
>But why would the performance impact be limited to
>only the execution width or the memory subsystem and never both?
Well... There are some integer workload were BD is actually faster than GH by a significant amount, from were the performance gains come?
These workloads happens to be memory intensive...
>Do you really think that going from ~12 to ~20 cycles L2 latency while not increasing
>the clock speeds nearly as much carries zero penalty? And that going from 64KB L1
>at 3 cycles to 16KB L1 at 4 cycles, albeit more associative, isn't going to hurt
>some workloads?
Because it is not just "small L1 and slow L2",
1) Yes, the frequency didn't improved much over GH45, but that was not because of the uarch, look at Llano frequency, the uarch itself allows for a much higher clock and all latencies payed for it, sure, the end result is that the end product performance sucks, but this says nothing about uarch improvements, put the previous memory subsystem on such slow process and the end product will sucks even more.
2) There is more than just two metrics, even with a small L1 the cache hit rate is between 95%-99%, if we would size L2 to meet this hit rate (for the second level 1 - L2 miss/ L1 miss) the cache would be very big, the 256kB after a 32kB L1 on Intel or 512kB after a 64kB won't do it, they will be more in the 80%-95% range, the size difference between L1 and L2 aren't that big, now the penality of L1 miss, it is not just the latency, but what happens at the core when a L1 miss happens, will it flush the pipeline? Reissue a lot of instructions? Halt? Immediatly execute independent instructions? The behaviour varies a lot here and is hard to find the details by testing, K-8 were pretty bad handling L1 misses it was almost like a halt, on the Intel side they reduced L2 size and latency after implementing SMT, SMT should perform like two cores on workloads with very low IPC and dominated by cache/memory latency, but that was not the case with Nehalem, maybe L1 misses weren't perfectly dodged by SMT.
Since there other penalities in L1 misses other than latency and other methods to reduce these penalities, without knowing how BD behave when a L1 miss happens it is impossible to say precisely how hard the 20 cycle latency strikes, apparently it doesn't have to hit the red button when a L1 miss happen.
Oh, I have actually tested every processor cited in item 2 except BD, and the one I know better is K-8.
3) There is the up side in that L2, the size, it is 20 cycles and that is, L2 misses are much more rare, neither K8/GH or BD instruction window is big enough to hide memory latency, so while L1 miss may not means a halt L2 miss means.
---------------------------
>The point is that it can't be a very reliable document if it omits something so fundamental.
Indeed, AMD docs used to be better.
>Agner Fog's documentation indicates that the AGUs can perform loads without issuing
>to the EX ports, and David said that he doesn't see why this would not be the case
>- can you think of anything, outside of the questionable SOG suggesting it?
It's contradictory:
1) If there is no AGUx to AGUx forwarding than L1 latency tests would be hit by a penality and indicate more than 4 cycles, but they didn't;
2) Maybe the latency is 3 cycles and they report 4 because of the no forwarding penality, but then load-execute instructions with forwarding would take just 4 cycles, IIRC they take 5, Agner omits this part.
>Or maybe by memory subsystem you mean the load/store units themselves? Because
>that is indeed where BD has improved. If this is what you mean then I think we have a communication problem.
I mean the load/store units, L1, L2, L3, memory, prefetchers and everything that goes on the middle, everyone else seens to just look at L1 size, L2 latency, L3 latency and memory latency.
>But why would the performance impact be limited to
>only the execution width or the memory subsystem and never both?
Well... There are some integer workload were BD is actually faster than GH by a significant amount, from were the performance gains come?
These workloads happens to be memory intensive...
>Do you really think that going from ~12 to ~20 cycles L2 latency while not increasing
>the clock speeds nearly as much carries zero penalty? And that going from 64KB L1
>at 3 cycles to 16KB L1 at 4 cycles, albeit more associative, isn't going to hurt
>some workloads?
Because it is not just "small L1 and slow L2",
1) Yes, the frequency didn't improved much over GH45, but that was not because of the uarch, look at Llano frequency, the uarch itself allows for a much higher clock and all latencies payed for it, sure, the end result is that the end product performance sucks, but this says nothing about uarch improvements, put the previous memory subsystem on such slow process and the end product will sucks even more.
2) There is more than just two metrics, even with a small L1 the cache hit rate is between 95%-99%, if we would size L2 to meet this hit rate (for the second level 1 - L2 miss/ L1 miss) the cache would be very big, the 256kB after a 32kB L1 on Intel or 512kB after a 64kB won't do it, they will be more in the 80%-95% range, the size difference between L1 and L2 aren't that big, now the penality of L1 miss, it is not just the latency, but what happens at the core when a L1 miss happens, will it flush the pipeline? Reissue a lot of instructions? Halt? Immediatly execute independent instructions? The behaviour varies a lot here and is hard to find the details by testing, K-8 were pretty bad handling L1 misses it was almost like a halt, on the Intel side they reduced L2 size and latency after implementing SMT, SMT should perform like two cores on workloads with very low IPC and dominated by cache/memory latency, but that was not the case with Nehalem, maybe L1 misses weren't perfectly dodged by SMT.
Since there other penalities in L1 misses other than latency and other methods to reduce these penalities, without knowing how BD behave when a L1 miss happens it is impossible to say precisely how hard the 20 cycle latency strikes, apparently it doesn't have to hit the red button when a L1 miss happen.
Oh, I have actually tested every processor cited in item 2 except BD, and the one I know better is K-8.
3) There is the up side in that L2, the size, it is 20 cycles and that is, L2 misses are much more rare, neither K8/GH or BD instruction window is big enough to hide memory latency, so while L1 miss may not means a halt L2 miss means.
Topic | Posted By | Date |
---|---|---|
Phoronix tests GCC compiler flags and Bulldozer. | I.S.T. | 2012/04/19 02:05 AM |
Single page view? | David Kanter | 2012/04/19 07:59 AM |
Single page view? | wainwright | 2012/04/19 08:22 AM |
Single page view? | slothrop | 2012/04/19 08:23 AM |
Single page view? | David Kanter | 2012/04/19 08:31 AM |
Single page view? | EduardoS | 2012/04/19 02:12 PM |
Is there a single page view option for RWT articles? | anon | 2012/04/19 08:27 AM |
Single page view? | Del | 2012/04/19 08:36 AM |
Single page view? | slacker | 2012/04/19 02:56 PM |
Single page view? | Del | 2012/04/22 05:09 AM |
Single page view? | David Kanter | 2012/04/22 08:38 AM |
Single page view? | Del | 2012/04/23 12:22 AM |
Single page view? | Michael S | 2012/04/19 12:30 PM |
Single page view? | Ungo | 2012/04/19 01:25 PM |
Single page view? | Foo_ | 2012/04/19 11:17 PM |
Single page view? | James | 2012/04/20 03:01 AM |
There are ads on the web? | JJB | 2012/04/20 03:32 AM |
What a bunch of freeloaders (NT) | slacker | 2012/04/20 12:44 PM |
So are you, probably | iz | 2012/04/21 03:41 AM |
Impression ad revenue | Paul A. Clayton | 2012/04/21 05:44 AM |
So are you, probably | slacker | 2012/04/21 12:09 PM |
So are you, probably | David Kanter | 2012/04/22 08:41 AM |
So are you, probably | iz | 2012/04/22 02:57 PM |
So are you, probably | Doug Siebert | 2012/04/22 11:37 AM |
Aha! | David Kanter | 2012/04/22 02:45 PM |
Aha! | bakaneko | 2012/04/22 07:49 PM |
So are you, probably | iz | 2012/04/22 02:48 PM |
That's not how the business works... | David Kanter | 2012/04/22 04:31 PM |
That's not how the business works... | iz | 2012/04/23 12:49 AM |
So are you, probably | slacker | 2012/04/22 10:31 PM |
back to phoronix | Michael S | 2012/04/23 01:07 AM |
So are you, probably | iz | 2012/04/23 02:29 AM |
Membership at RWT | David Kanter | 2012/04/23 10:24 AM |
So are you, probably | Jukka Larja | 2012/04/27 07:59 AM |
So, what do people think of these numbers> | I.S.T. | 2012/04/19 06:34 PM |
So, what do people think of these numbers> | Linus Torvalds | 2012/04/20 07:34 AM |
So, what do people think of these numbers> | Kira | 2012/04/20 08:18 AM |
So, what do people think of these numbers> | Linus Torvalds | 2012/04/20 09:05 AM |
So, what do people think of these numbers> | Doug Siebert | 2012/04/20 08:00 PM |
So, what do people think of these numbers> | Megol | 2012/04/21 08:05 AM |
So, what do people think of these numbers> | Linus Torvalds | 2012/04/21 12:11 PM |
Most problems are fixed... | Megol | 2012/04/24 06:00 AM |
So, what do people think of these numbers> | bakaneko | 2012/04/20 10:16 AM |
So, what do people think of these numbers> | bakaneko | 2012/04/20 10:37 AM |
So, what do people think of these numbers> | Linus Torvalds | 2012/04/20 12:24 PM |
So, what do people think of these numbers> | Joel | 2012/04/20 01:59 PM |
So, what do people think of these numbers> | Kira | 2012/04/20 02:32 PM |
So, what do people think of these numbers> | EduardoS | 2012/04/20 03:00 PM |
Bulldozer's Oddities. | Joel | 2012/04/20 03:54 PM |
In defense of Bulldozer's Oddities | David Kanter | 2012/04/20 04:32 PM |
In defense of Bulldozer's Oddities | Exophase | 2012/04/20 06:11 PM |
In defense of Bulldozer's Oddities | EduardoS | 2012/04/20 06:46 PM |
In defense of Bulldozer's Oddities | Exophase | 2012/04/20 07:18 PM |
In defense of Bulldozer's Oddities | anonymous | 2012/04/20 10:26 PM |
In defense of Bulldozer's Oddities | JJB | 2012/04/20 10:34 PM |
In defense of Bulldozer's Oddities | imaxx | 2012/04/21 06:21 AM |
In defense of Bulldozer's Oddities | Michael S | 2012/04/21 09:42 AM |
Bulldozer's integer execution units | David Kanter | 2012/04/25 03:29 PM |
Bulldozer's integer execution units | Exophase | 2012/04/26 11:17 AM |
Bulldozer's integer execution units | anonymous | 2012/04/26 02:15 PM |
Bulldozer's integer execution units | EduardoS | 2012/04/26 02:40 PM |
Bulldozer's integer execution units | Foo_ | 2012/04/27 07:21 AM |
Bulldozer's integer execution units | Megol | 2012/04/27 12:38 PM |
Bulldozer's integer execution units | EduardoS | 2012/04/26 02:47 PM |
Bulldozer's integer execution units | Exophase | 2012/04/26 04:02 PM |
Bulldozer's integer execution units | EduardoS | 2012/04/26 05:03 PM |
Bulldozer's integer execution units | Exophase | 2012/04/26 05:24 PM |
Bulldozer's integer execution units | EduardoS | 2012/04/26 06:18 PM |
Bulldozer's cache memory performance | Heikki Kultala | 2012/04/28 12:18 AM |
Bulldozer's cache memory performance | EduardoS | 2012/04/28 09:06 AM |
Bulldozer's integer execution units | David Kanter | 2012/04/26 03:03 PM |
Bulldozer's integer execution units | Exophase | 2012/04/26 03:59 PM |
Bulldozer's integer execution units | David Kanter | 2012/04/26 09:53 PM |
Bulldozer's integer execution units | Exophase | 2012/04/27 07:42 AM |
Bulldozer's integer execution units | David Kanter | 2012/04/27 10:06 AM |
Bulldozer's integer execution units | EduardoS | 2012/04/27 12:27 PM |
K8 divided pipelines? | Paul A. Clayton | 2012/04/27 12:59 PM |
Bulldozer's integer execution units | Michael S | 2012/04/27 03:37 AM |
Bulldozer's integer execution units | Exophase | 2012/04/27 07:33 AM |
Bulldozer's integer execution units | anonymous | 2012/04/27 08:03 AM |
Renaming Flags | Konrad Schwarz | 2012/04/27 02:04 AM |
Renaming Flags | none | 2012/04/27 03:03 AM |
Renaming Flags | Megol | 2012/04/27 11:42 AM |
Bulldozer's integer execution units | hcl64 | 2012/04/27 03:31 PM |
VEX supports 3+ operands. FPU have renaming already(NT) | Megol | 2012/04/28 07:20 AM |
In defense of Bulldozer's Oddities | Linus Torvalds | 2012/04/21 11:26 AM |
Thanks for the lesson | JJB | 2012/04/21 01:23 PM |
Side note.. | Linus Torvalds | 2012/04/21 01:57 PM |
In defense of Bulldozer's Oddities | Exophase | 2012/04/21 11:13 AM |
In defense of Bulldozer's Oddities | EduardoS | 2012/04/21 11:53 AM |
In defense of Bulldozer's Oddities | Gionatan Danti | 2012/04/21 11:42 AM |
In defense of Bulldozer's Oddities | hcl64 | 2012/04/27 04:07 PM |
In defense of Bulldozer's Oddities | David Kanter | 2012/04/28 05:29 AM |
In defense of Bulldozer's Oddities | hcl64 | 2012/04/28 01:44 PM |
In defense of Bulldozer's Oddities | David Kanter | 2012/04/28 08:42 PM |
In defense of Bulldozer's Oddities | hcl64 | 2012/04/28 09:39 PM |
Bulldozer's Oddities. | EduardoS | 2012/04/20 05:05 PM |
Bulldozer's Oddities. | anon | 2012/04/20 07:32 PM |
Bulldozer's Oddities. | EduardoS | 2012/04/21 11:37 AM |
Bulldozer's Oddities. | anon | 2012/04/21 09:16 PM |
Bulldozer's Oddities. | EduardoS | 2012/04/21 09:43 PM |
Bulldozer's Oddities. | anon | 2012/04/22 01:09 AM |
Bulldozer's Oddities. | EduardoS | 2012/04/22 12:57 PM |
Bulldozer's Oddities. | anon | 2012/04/22 03:17 PM |
Bulldozer's Oddities. | EduardoS | 2012/04/22 04:05 PM |
Bulldozer's Oddities. | anon | 2012/04/22 04:42 PM |
Bulldozer's Oddities. | anon | 2012/04/22 05:01 PM |
Bulldozer's Oddities. | EduardoS | 2012/04/22 09:28 PM |
Bulldozer's Oddities. | anon | 2012/04/22 10:05 PM |
Bulldozer's isn't bad. | a reader | 2012/04/21 09:01 AM |
Bulldozer's isn't bad. | Kira | 2012/04/21 10:29 AM |
Bulldozer's isn't bad. | hcl64 | 2012/04/27 04:58 PM |
Bulldozer's isn't bad. | anon | 2012/04/27 05:16 PM |
Bulldozer's isn't bad. | hcl64 | 2012/04/27 06:33 PM |
Bulldozer's isn't bad. | rwessel | 2012/04/27 10:12 PM |
Bulldozer's isn't bad. | EduardoS | 2012/04/28 08:29 AM |
Bulldozer's isn't bad. | EduardoS | 2012/04/28 08:30 AM |
Bulldozer's isn't bad. | Michael S | 2012/04/28 11:36 AM |
Bulldozer is made for SPEC fp | Pelle-48 | 2012/04/21 10:41 AM |
Bulldozer's Oddities. | mpx | 2012/04/22 02:47 AM |
Bulldozer's Oddities. | EduardoS | 2012/04/22 12:57 PM |
Bulldozer's Oddities. | mpx | 2012/04/23 06:04 AM |
Bulldozer's Oddities. | Eric | 2012/04/23 11:33 AM |
Bulldozer's Oddities. | EduardoS | 2012/04/23 01:22 PM |
Bulldozer's Oddities. | Eric | 2012/04/23 06:30 PM |
Bulldozer's Oddities. | hcl64 | 2012/04/27 05:16 PM |
Bulldozer's Oddities. | Y | 2012/04/25 03:34 AM |
Bulldozer's IDIV | Heikki Kultala | 2012/04/27 09:56 PM |
Bulldozer's IDIV | Y | 2012/04/30 12:51 AM |
Bulldozer's IDIV | EduardoS | 2012/04/30 04:39 AM |
Bulldozer's IDIV | P3Dnow | 2012/05/08 12:23 AM |
Bulldozer's IDIV | Exophase | 2012/05/08 06:37 AM |
Bulldozer's Oddities. | EduardoS | 2012/04/23 01:15 PM |
Clustered MT as SMT for high frequency | Paul A. Clayton | 2012/04/20 03:10 PM |
Clustered MT as SMT for high frequency | hcl64 | 2012/04/27 11:56 PM |
Clustered MT as SMT for high frequency | anonymous | 2012/04/28 12:43 AM |
Clustered MT as SMT for high frequency | hcl64 | 2012/04/28 01:59 PM |
Clustered MT as SMT for high frequency | anonymous | 2012/04/28 07:45 PM |
Clustered MT as SMT for high frequency | anon | 2012/04/28 01:13 AM |
Clustered MT as SMT for high frequency | hcl64 | 2012/04/28 02:23 PM |
Clustered MT as SMT for high frequency | anon | 2012/04/28 05:19 PM |
Clustered MT as SMT for high frequency | hcl64 | 2012/04/28 06:58 PM |
Clustered MT as SMT for high frequency | David Kanter | 2012/04/28 05:38 AM |
Guessed meaning of "strong dependency model" | Paul A. Clayton | 2012/04/28 06:24 AM |
Guessed meaning of "strong dependency model" | EduardoS | 2012/04/28 08:46 AM |
*Right meaning* about "strong dependency model" | hcl64 | 2012/04/28 03:59 PM |
Clustered MT as SMT for high frequency | hcl64 | 2012/04/28 03:24 PM |
Clustered MT as SMT for high frequency | anonymous | 2012/04/28 07:50 PM |
Clustered MT as SMT for high frequency | hcl64 | 2012/04/28 08:47 PM |
SNB width | David Kanter | 2012/04/28 08:48 PM |
SNB width | hcl64 | 2012/04/29 01:24 AM |
Clustered MT as SMT for high frequency | David Kanter | 2012/04/28 08:56 PM |
Clustered MT as SMT for high frequency | hcl64 | 2012/04/28 10:44 PM |
SOI, FD vs. PD | David Kanter | 2012/04/29 06:19 AM |
SOI, FD vs. PD | hcl64 | 2012/04/29 04:31 PM |
SOI, FD vs. PD | David Kanter | 2012/04/29 10:26 PM |
SOI, FD vs. PD | hcl64 | 2012/04/30 07:08 AM |
SOI, FD vs. PD | David Kanter | 2012/04/30 08:59 AM |
SOI, FD vs. PD | hcl64 | 2012/04/30 05:10 PM |
SOI, FD vs. PD | David Kanter | 2012/04/30 05:32 PM |
SOI, FD vs. PD | hcl64 | 2012/04/30 09:47 PM |
SOI, FD vs. PD | David Kanter | 2012/05/01 01:24 AM |
SOI, FD vs. PD | hcl64 | 2012/05/01 04:46 AM |
SOI, FD vs. PD | hcl64 | 2012/05/01 05:37 AM |
SOI, FD vs. PD | David Kanter | 2012/05/01 07:19 AM |
SOI, FD vs. PD | hcl64 | 2012/05/01 06:39 AM |
PD-SOI | David Kanter | 2012/05/02 11:22 AM |
SOI, FD vs. PD | slacker | 2012/04/30 07:10 PM |
SOI, FD vs. PD | David Kanter | 2012/04/30 09:16 PM |
SOI, FD vs. PD | slacker | 2012/05/01 09:04 PM |
SOI, FD vs. PD | David Kanter | 2012/05/02 07:19 AM |
SOI, FD vs. PD | zou | 2012/05/02 11:23 AM |
Previous discussion of clustered MT | Paul A. Clayton | 2012/04/28 06:00 AM |
Previous discussion of clustered MT | hcl64 | 2012/04/28 08:38 PM |
Previous discussion of clustered MT | David Kanter | 2012/04/30 03:37 PM |
Previous discussion of clustered MT | hcl64 | 2012/04/30 06:24 PM |
Previous discussion of clustered MT | David Kanter | 2012/04/30 06:40 PM |
Previous discussion of clustered MT | hcl64 | 2012/05/01 08:15 AM |
Latency issues | David Kanter | 2012/05/02 11:01 AM |
So, what do people think of these numbers> | Megol | 2012/04/21 12:57 AM |