By: Andrei F (andrei.delete@this.anandtech.com), September 21, 2020 5:50 am
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on September 21, 2020 1:38 am wrote:
> Travis Downs (travis.downs.delete@this.gmail.com) on September 20, 2020 5:34 pm wrote:
> > Michael S (already5chosen.delete@this.yahoo.com) on September 20, 2020 10:02 am wrote:
> > > Travis Downs (travis.downs.delete@this.gmail.com) on September 19, 2020 8:26 pm wrote:
> > > > Andrei F (andrei.delete@this.anandtech.com) on September 18, 2020 1:04 am wrote:
> > > > > anon (anon.delete@this.anon.com) on September 17, 2020 7:10 pm wrote:
> > > > > > AnandTech's (SPEC ST performance) review is here: anandtech.com/show/16084/intel-tiger-lake-review-deep-dive-core-11th-gen/8
> > > > > > However not all is good: TigerLake
> > > > > > experiences a noticeable IPC regression compared to IceLake. The memory subsystem is unable
> > > > > > to keep up with the higher clocks, and the reworked cache is not enough.
> > > > > >
> > > > >
> > > > > I just want to add on that sentence as that's not what I wrote
> > > > > in the piece: I don't think the memory subsystem is to blame.
> > > > >
> > > > > It's significantly stronger than ICL and showcases *much* better DRAM latency and significant
> > > > > single core bandwidth uplift. 429.mcf showcases great scaling well beyond clocks, showing
> > > > > that latency for example is not to blame. In my opinion it's a regression *because* of the
> > > > > reworked cache, as essentially the L3 is now 20% slower per clock versus ICL.
> > > >
> > > > You mean L3 latency, right? It might be a part of it, but the regression in libquantum
> > > > and lbm are too large to be explained by this few cycle change, I think. You'd pretty much
> > > > have to write a dedicated L3 latency test to get that big of a drop and IIRC neither of
> > > > those are known to be very dependent on L3 latency (they are more bandwidth heavy).
> > > >
> > > > So I think there's something else more interesting going on there.
> > > >
> > > >
> > >
> > > TGL uncore appears to be inspired SKX, except, hopefully, better latency of LLC misses under light load.
> > > So, may be, it suffers from similarly low single-core bandwidth?
> > >
> >
> > Well Andrei has some detailed bandwidth benchmarks on this page and performance looks
> > better across the board: there's actually a significant bump in L3 and RAM regions.
> >
>
> Yes, it's better than ICL.
> But probably quite a lot worse than desktop SKL. Out of memory ( :-) ), my E-2176G achieves 33-35 GB/s on long
> sequential reads, supposedly similar to Andrei's Vec128 LD test. If I am not mistaken, even i7-6920HQ with DDR4-2133
> that I was playing with couple of years ago, was capable to do 30 GB/s. From raw bandwidth perspective LPDDR4X-4266
> in TGL rig should be equal to DDR4-2133, right? But the end result is somehow 1.5x lower.
> I have no idea what "flip" tests do, so can't compare.
>
> > So I feel like it has to be something more complicated than just worse peak BW: maybe a different
> > way of splitting power between core, uncore and memory? Paul Alcorn from Tomshardware suggested
> > that memory frequency itself can be varied on this part, not sure if that's correct. I don't
> > think any previous Intel part had frequency scaling for the memory bus?
> >
>
>
The flip test is a memory copy test that sits inside a fixed memory region, moving cachelines from one end to the other end, essentially flipping the memory region around on a cacheline block basis.
It's basically the same bandwidth as a traditional memory copy just different locality in virtual memory.
---
I did some more characterisations via counters on a 9900K to see where the stress-points are. Essentially the Willow Cove improvements regressions follow this formula:
- If the workload has a high HPKI of loads and store in the L3, but a low MKPI, then the workload sees a large performance improvement due to the much bigger L2 cache, due to it previously having a very high miss %.
xalanc and astar follow this behaviour, with high L3 hits but very high L2 misses.
- If the workload has both a high HPKI and MPKI for L3 loads and stores and there's a large % of misses versus hits, then these workloads correspond to the biggest losers for Willow Cove.
https://pbs.twimg.com/media/EiIBUUHWsAMH5Dl?format=png&name=orig
This is essentially all the red workloads.
- The only exception to the above seem to be workloads that are primarily DRAM latency limited and have extremely high memory stall cycles. MCF and omnetpp correspond to this characterisation and on my 9900K have 55.3% and 61.1% stall cycles.
These workloads seem to have very low MLP and are more pointer-chaser like, and here Tiger Lake's much better DRAM latency is counteracting any slowdowns on the part of the L3.
> Travis Downs (travis.downs.delete@this.gmail.com) on September 20, 2020 5:34 pm wrote:
> > Michael S (already5chosen.delete@this.yahoo.com) on September 20, 2020 10:02 am wrote:
> > > Travis Downs (travis.downs.delete@this.gmail.com) on September 19, 2020 8:26 pm wrote:
> > > > Andrei F (andrei.delete@this.anandtech.com) on September 18, 2020 1:04 am wrote:
> > > > > anon (anon.delete@this.anon.com) on September 17, 2020 7:10 pm wrote:
> > > > > > AnandTech's (SPEC ST performance) review is here: anandtech.com/show/16084/intel-tiger-lake-review-deep-dive-core-11th-gen/8
> > > > > > However not all is good: TigerLake
> > > > > > experiences a noticeable IPC regression compared to IceLake. The memory subsystem is unable
> > > > > > to keep up with the higher clocks, and the reworked cache is not enough.
> > > > > >
> > > > >
> > > > > I just want to add on that sentence as that's not what I wrote
> > > > > in the piece: I don't think the memory subsystem is to blame.
> > > > >
> > > > > It's significantly stronger than ICL and showcases *much* better DRAM latency and significant
> > > > > single core bandwidth uplift. 429.mcf showcases great scaling well beyond clocks, showing
> > > > > that latency for example is not to blame. In my opinion it's a regression *because* of the
> > > > > reworked cache, as essentially the L3 is now 20% slower per clock versus ICL.
> > > >
> > > > You mean L3 latency, right? It might be a part of it, but the regression in libquantum
> > > > and lbm are too large to be explained by this few cycle change, I think. You'd pretty much
> > > > have to write a dedicated L3 latency test to get that big of a drop and IIRC neither of
> > > > those are known to be very dependent on L3 latency (they are more bandwidth heavy).
> > > >
> > > > So I think there's something else more interesting going on there.
> > > >
> > > >
> > >
> > > TGL uncore appears to be inspired SKX, except, hopefully, better latency of LLC misses under light load.
> > > So, may be, it suffers from similarly low single-core bandwidth?
> > >
> >
> > Well Andrei has some detailed bandwidth benchmarks on this page and performance looks
> > better across the board: there's actually a significant bump in L3 and RAM regions.
> >
>
> Yes, it's better than ICL.
> But probably quite a lot worse than desktop SKL. Out of memory ( :-) ), my E-2176G achieves 33-35 GB/s on long
> sequential reads, supposedly similar to Andrei's Vec128 LD test. If I am not mistaken, even i7-6920HQ with DDR4-2133
> that I was playing with couple of years ago, was capable to do 30 GB/s. From raw bandwidth perspective LPDDR4X-4266
> in TGL rig should be equal to DDR4-2133, right? But the end result is somehow 1.5x lower.
> I have no idea what "flip" tests do, so can't compare.
>
> > So I feel like it has to be something more complicated than just worse peak BW: maybe a different
> > way of splitting power between core, uncore and memory? Paul Alcorn from Tomshardware suggested
> > that memory frequency itself can be varied on this part, not sure if that's correct. I don't
> > think any previous Intel part had frequency scaling for the memory bus?
> >
>
>
The flip test is a memory copy test that sits inside a fixed memory region, moving cachelines from one end to the other end, essentially flipping the memory region around on a cacheline block basis.
It's basically the same bandwidth as a traditional memory copy just different locality in virtual memory.
---
I did some more characterisations via counters on a 9900K to see where the stress-points are. Essentially the Willow Cove improvements regressions follow this formula:
- If the workload has a high HPKI of loads and store in the L3, but a low MKPI, then the workload sees a large performance improvement due to the much bigger L2 cache, due to it previously having a very high miss %.
xalanc and astar follow this behaviour, with high L3 hits but very high L2 misses.
- If the workload has both a high HPKI and MPKI for L3 loads and stores and there's a large % of misses versus hits, then these workloads correspond to the biggest losers for Willow Cove.
https://pbs.twimg.com/media/EiIBUUHWsAMH5Dl?format=png&name=orig
This is essentially all the red workloads.
- The only exception to the above seem to be workloads that are primarily DRAM latency limited and have extremely high memory stall cycles. MCF and omnetpp correspond to this characterisation and on my 9900K have 55.3% and 61.1% stall cycles.
These workloads seem to have very low MLP and are more pointer-chaser like, and here Tiger Lake's much better DRAM latency is counteracting any slowdowns on the part of the L3.
Topic | Posted By | Date |
---|---|---|
Tiger Lake performance profile | anon | 2020/09/17 06:10 PM |
Tiger Lake performance profile | Clipping Coupons | 2020/09/17 07:22 PM |
Tiger Lake performance profile | Doug S | 2020/09/17 09:36 PM |
Tiger Lake performance profile | Jose | 2020/09/18 12:24 AM |
Tiger Lake performance profile | Andrei F | 2020/09/18 02:26 AM |
Tiger Lake performance profile | itsmydamnation | 2020/09/18 02:19 PM |
Tiger Lake performance profile | Maynard Handley | 2020/09/18 04:00 PM |
Tiger Lake performance profile | Andrei F | 2020/09/19 07:29 AM |
Tiger Lake performance profile | Maynard Handley | 2020/09/19 09:34 AM |
Tiger Lake performance profile | Andrei F | 2020/09/19 09:43 AM |
Tiger Lake performance profile | anon | 2020/09/19 10:08 AM |
Tiger Lake performance profile | Andrei Frumusanu | 2020/09/19 10:52 AM |
Tiger Lake performance profile | anon | 2020/09/19 11:50 AM |
Tiger Lake performance profile | Andrei F | 2020/09/19 12:27 PM |
Tiger Lake performance profile | -.- | 2020/09/19 03:31 PM |
Tiger Lake performance profile | Jose | 2020/09/19 01:40 AM |
Tiger Lake performance profile | Andrei F | 2020/09/19 07:25 AM |
Tiger Lake performance profile | Jose | 2020/09/23 12:27 AM |
Tiger Lake performance profile | juanrga | 2020/09/18 01:38 AM |
Tiger Lake performance profile | Doug S | 2020/09/18 08:25 AM |
Tiger Lake performance profile | Andrei F | 2020/09/18 12:04 AM |
Tiger Lake performance profile | Anon | 2020/09/18 02:25 AM |
Tiger Lake performance profile | Andrei F | 2020/09/18 02:31 AM |
Tiger Lake performance profile | Travis Downs | 2020/09/19 07:26 PM |
Tiger Lake performance profile | Michael S | 2020/09/20 09:02 AM |
Tiger Lake performance profile | Travis Downs | 2020/09/20 04:34 PM |
Tiger Lake performance profile | Michael S | 2020/09/21 12:38 AM |
Tiger Lake performance profile | Andrei F | 2020/09/21 05:50 AM |
MKPI ? MPKI ? HPKI ? (NT) | Michael S | 2020/09/21 06:03 AM |
MKPI ? MPKI ? HPKI ? | Anon | 2020/09/21 06:22 AM |
thank you (NT) | Michael S | 2020/09/21 06:42 AM |
MKPI ? MPKI ? HPKI ? | none | 2020/09/22 12:12 AM |
SPEC Memory traffic & bandwidth | Andrei F | 2020/09/21 07:35 AM |
SPEC Memory traffic & bandwidth | Andrei F | 2020/09/21 07:36 AM |
SPEC Memory traffic & bandwidth | David Kanter | 2020/09/21 01:31 PM |
What is the meaning of multiple rows in few subtests? (NT) | Michael S | 2020/09/21 07:45 AM |
What is the meaning of multiple rows in few subtests? | Andrei F | 2020/09/21 07:57 AM |
Poor L1D load bandwidth | Eric Bron | 2020/09/21 05:56 AM |
erratum | Eric Bron | 2020/09/21 05:59 AM |
Sorry I missread the graph | Eric Bron | 2020/09/21 06:14 AM |
Poor main memory load bandwidth | Michael S | 2020/09/21 06:19 AM |
Tiger Lake performance profile | Travis Downs | 2020/09/21 02:51 PM |
Tiger Lake performance profile | Andrei F | 2020/09/22 06:03 AM |
Tiger Lake security fixes possible cause? | Kevin G | 2020/09/22 05:10 AM |
Tiger Lake security fixes possible cause? | Travis Downs | 2020/09/22 06:26 AM |
Superiority | Michael S | 2020/09/18 01:58 AM |
Superiority | Andrei F | 2020/09/18 02:39 AM |
Superiority | Robert Müller | 2020/09/18 02:59 AM |
Superiority | Andrei F | 2020/09/18 03:47 AM |
Superiority | Robert Müller | 2020/09/18 04:45 AM |
Superiority | Andrei F | 2020/09/18 05:17 AM |
Superiority | Travis Downs | 2020/09/18 06:21 AM |
Superiority | anon | 2020/09/18 11:34 AM |
Superiority | Michael S | 2020/09/18 05:06 AM |
Superiority | Foo_ | 2020/09/18 05:17 AM |
Superiority | Michael S | 2020/09/18 06:08 AM |
Superiority | David Hess | 2020/09/18 11:55 AM |
Superiority | Adrian | 2020/09/18 04:56 AM |
Superiority | Michael S | 2020/09/18 06:51 AM |
Superiority | Adrian | 2020/09/18 08:35 AM |
Superiority | thePirate | 2020/09/19 01:28 AM |