By: Chester (lamchester.delete@this.gmail.com), May 6, 2021 4:29 pm
Room: Moderated Discussions
Heikki Kultala (heikk.i.kultal.a.delete@this.gmail.com) on May 6, 2021 12:46 pm wrote:
> wumpus (wumpus.delete@this.lost.in.a.hole) on May 6, 2021 9:06 am wrote:
> > Chester (lamchester.delete@this.gmail.com) on May 5, 2021 6:45 pm wrote:
> > > > > Assuming you still want to keep "L1 "way size" = page size, that gives you
> > > > > 8 cachelines per "way". I think once ARM made a 32-way L1 cache that they
> > > > > claimed it was faster as 32-way, but that was certainly the exception.
> > > > >
> > > > > Do you want "60-way" caches?
> > > >
> > > > No.
> > > >
> > > > > Add some sort of inital TLB lookup to the L1 latency
> > > > > (which of course would require more entries, because smaller pages)?
> > > >
> > > > I don't know what your question is.
> > >
> > > I think that refers to VIPT caches, where way size = page size is natural?
> > >
> > > But K10 got 3 cycle latency with a 64K 2-way L1D, so clearly way size = page size isn't the only way to go.
> > >
> >
> > True, but I have to assume that they looked up the TLB first. And you'd need a
> > bigger TLB than K10 if you had 512B pages instead of 4k pages. You could pull
> > 16 values, but that wouldn't be any easier than just going to a 16 way cache.
> >
> > Or possibly they went a step beyond a "way picker" that
> > would predict which page the value was on without the
> > full TLB. If so, they never seemed to use it again (with the possible exception of instruction caches).
> >
> > way size = page size is an extremely good fit, and the exceptions don't seem to repeat.
Do K7, K8, and K10 count as repeated exceptions? All had a 64K 2-way L1D (and L1i).
Afaik Apple M1 doesn't have way size = page size either and it looks like a competent chip.
> > I'll note that the early Phenom had a nasty TLB bug (patching it took a performance hit),
> > which might make AMD wary of putting the TLB in such a critical part of the path.
>
> The TLB bug of Phenom was only related to TLB fills from caches. It had nothing to do with TLB hits.
>
> I don't remember the exact details but It was something
> like cache coherency being broken for cached page tables.
Yeah, sounds exactly like that from AT's description of a problem. Just a screwup in AMD's early L3 cache implementation that was fixed in later Phenoms.
> wumpus (wumpus.delete@this.lost.in.a.hole) on May 6, 2021 9:06 am wrote:
> > Chester (lamchester.delete@this.gmail.com) on May 5, 2021 6:45 pm wrote:
> > > > > Assuming you still want to keep "L1 "way size" = page size, that gives you
> > > > > 8 cachelines per "way". I think once ARM made a 32-way L1 cache that they
> > > > > claimed it was faster as 32-way, but that was certainly the exception.
> > > > >
> > > > > Do you want "60-way" caches?
> > > >
> > > > No.
> > > >
> > > > > Add some sort of inital TLB lookup to the L1 latency
> > > > > (which of course would require more entries, because smaller pages)?
> > > >
> > > > I don't know what your question is.
> > >
> > > I think that refers to VIPT caches, where way size = page size is natural?
> > >
> > > But K10 got 3 cycle latency with a 64K 2-way L1D, so clearly way size = page size isn't the only way to go.
> > >
> >
> > True, but I have to assume that they looked up the TLB first. And you'd need a
> > bigger TLB than K10 if you had 512B pages instead of 4k pages. You could pull
> > 16 values, but that wouldn't be any easier than just going to a 16 way cache.
> >
> > Or possibly they went a step beyond a "way picker" that
> > would predict which page the value was on without the
> > full TLB. If so, they never seemed to use it again (with the possible exception of instruction caches).
> >
> > way size = page size is an extremely good fit, and the exceptions don't seem to repeat.
Do K7, K8, and K10 count as repeated exceptions? All had a 64K 2-way L1D (and L1i).
Afaik Apple M1 doesn't have way size = page size either and it looks like a competent chip.
> > I'll note that the early Phenom had a nasty TLB bug (patching it took a performance hit),
> > which might make AMD wary of putting the TLB in such a critical part of the path.
>
> The TLB bug of Phenom was only related to TLB fills from caches. It had nothing to do with TLB hits.
>
> I don't remember the exact details but It was something
> like cache coherency being broken for cached page tables.
Yeah, sounds exactly like that from AT's description of a problem. Just a screwup in AMD's early L3 cache implementation that was fixed in later Phenoms.