By: Chester (lamchester.delete@this.gmail.com), May 26, 2021 3:03 pm
Room: Moderated Discussions
Chester (lamchester.delete@this.gmail.com) on May 26, 2021 2:54 pm wrote:
> Anon (no.delete@this.spam.com) on May 26, 2021 1:36 pm wrote:
> > Heikki Kultala (heikki.kul.tala.delete@this.gmail.com) on May 25, 2021 11:22 pm wrote:
> > > I'm not sure it even is a victim cache. As it's called "system level cache" it might be a
> > > memory side cache which also caches accesses made by other accelerators and PCIe devices.
> >
> > Would be possible that those 16MB cache gives preference for
> > cache lines evicted with any variant of "shared" state?
> >
> > I don't think it would be very useful if not.
>
> AMD's L3 prefers shared state in a slightly different way - shared
> lines stay in L3 while other states don't. To be exact:
> - L3 hit from store data access: line always invalidated from L3
> - L3 hit from load data access: line invalidated if only one core read it (not shared state)
> - L3 hit by more than one core: line stays valid in L3 (shared state)
> - L3 hit from code fetch: line stays valid in L3 (also shared state I think)
>
> Also data from non-temporal prefetches is not filled into L3 when evicted from L2. Contrary to what some
> places say, non-temporal has nothing to do with RFO and cache coherency. Non-temporal simply means temporal
> locality will be bad (data not likely to be used again soon) so don't waste cache space with it.
>
> The only way you avoid RFOs is when you know you're overwriting an entire cache line. Examples include P6
> using a non-RFO cache protocol for rep mov, and possibly avx-512 writes (though I couldn't find a source).
Oh edit: non-temporal writes on x86 can violate memory ordering, but not because of anything to do with avoiding RFOs. NT writes are batched in write-combining buffers and sent straight to memory, without being made visible in any particular order.
> Anon (no.delete@this.spam.com) on May 26, 2021 1:36 pm wrote:
> > Heikki Kultala (heikki.kul.tala.delete@this.gmail.com) on May 25, 2021 11:22 pm wrote:
> > > I'm not sure it even is a victim cache. As it's called "system level cache" it might be a
> > > memory side cache which also caches accesses made by other accelerators and PCIe devices.
> >
> > Would be possible that those 16MB cache gives preference for
> > cache lines evicted with any variant of "shared" state?
> >
> > I don't think it would be very useful if not.
>
> AMD's L3 prefers shared state in a slightly different way - shared
> lines stay in L3 while other states don't. To be exact:
> - L3 hit from store data access: line always invalidated from L3
> - L3 hit from load data access: line invalidated if only one core read it (not shared state)
> - L3 hit by more than one core: line stays valid in L3 (shared state)
> - L3 hit from code fetch: line stays valid in L3 (also shared state I think)
>
> Also data from non-temporal prefetches is not filled into L3 when evicted from L2. Contrary to what some
> places say, non-temporal has nothing to do with RFO and cache coherency. Non-temporal simply means temporal
> locality will be bad (data not likely to be used again soon) so don't waste cache space with it.
>
> The only way you avoid RFOs is when you know you're overwriting an entire cache line. Examples include P6
> using a non-RFO cache protocol for rep mov, and possibly avx-512 writes (though I couldn't find a source).
Oh edit: non-temporal writes on x86 can violate memory ordering, but not because of anything to do with avoiding RFOs. NT writes are batched in write-combining buffers and sent straight to memory, without being made visible in any particular order.
Topic | Posted By | Date |
---|---|---|
Ampere Altra Max 16MB LLC with 128 cores | Ganon | 2021/05/25 01:30 AM |
Ampere Altra Max 16MB LLC with 128 cores | anon | 2021/05/25 03:11 AM |
Ampere Altra Max 16MB LLC with 128 cores | Heikki Kultala | 2021/05/25 11:22 PM |
Ampere Altra Max 16MB LLC with 128 cores | Anon | 2021/05/26 01:36 PM |
Ampere Altra Max 16MB LLC with 128 cores | Chester | 2021/05/26 02:54 PM |
Ampere Altra Max 16MB LLC with 128 cores | Chester | 2021/05/26 03:03 PM |
Ampere Altra Max 16MB LLC with 128 cores | Doug S | 2021/05/25 07:50 AM |
Ampere Altra Max 16MB LLC with 128 cores | Andrei F | 2021/05/25 08:06 AM |
Ampere Altra Max 16MB LLC with 128 cores | Rayla | 2021/05/25 08:17 AM |
A few thoughts on Ampere's Altra Max | Paul A. Clayton | 2021/05/27 12:00 PM |
A few thoughts on Ampere's Altra Max | Björn Ragnar Björnsson | 2021/05/27 03:47 PM |
Yeah, I should have looked for and through a data sheet (NT) | Paul A. Clayton | 2021/05/27 06:25 PM |
A few thoughts on Ampere's Altra Max | Adrian | 2021/05/27 11:13 PM |
Boring can be profitable | Paul A. Clayton | 2021/05/29 12:18 PM |