By: Heikki Kultala (heikk.i.kultal.a.delete@this.gmail.com), May 6, 2021 11:46 am
Room: Moderated Discussions
wumpus (wumpus.delete@this.lost.in.a.hole) on May 6, 2021 9:06 am wrote:
> Chester (lamchester.delete@this.gmail.com) on May 5, 2021 6:45 pm wrote:
> > > > Assuming you still want to keep "L1 "way size" = page size, that gives you
> > > > 8 cachelines per "way". I think once ARM made a 32-way L1 cache that they
> > > > claimed it was faster as 32-way, but that was certainly the exception.
> > > >
> > > > Do you want "60-way" caches?
> > >
> > > No.
> > >
> > > > Add some sort of inital TLB lookup to the L1 latency
> > > > (which of course would require more entries, because smaller pages)?
> > >
> > > I don't know what your question is.
> >
> > I think that refers to VIPT caches, where way size = page size is natural?
> >
> > But K10 got 3 cycle latency with a 64K 2-way L1D, so clearly way size = page size isn't the only way to go.
> >
>
> True, but I have to assume that they looked up the TLB first. And you'd need a
> bigger TLB than K10 if you had 512B pages instead of 4k pages. You could pull
> 16 values, but that wouldn't be any easier than just going to a 16 way cache.
>
> Or possibly they went a step beyond a "way picker" that would predict which page the value was on without the
> full TLB. If so, they never seemed to use it again (with the possible exception of instruction caches).
>
> way size = page size is an extremely good fit, and the exceptions don't seem to repeat.
> I'll note that the early Phenom had a nasty TLB bug (patching it took a performance hit),
> which might make AMD wary of putting the TLB in such a critical part of the path.
The TLB bug of Phenom was only related to TLB fills from caches. It had nothing to do with TLB hits.
I don't remember the exact details but It was something like cache coherency being broken for cached page tables.
> Chester (lamchester.delete@this.gmail.com) on May 5, 2021 6:45 pm wrote:
> > > > Assuming you still want to keep "L1 "way size" = page size, that gives you
> > > > 8 cachelines per "way". I think once ARM made a 32-way L1 cache that they
> > > > claimed it was faster as 32-way, but that was certainly the exception.
> > > >
> > > > Do you want "60-way" caches?
> > >
> > > No.
> > >
> > > > Add some sort of inital TLB lookup to the L1 latency
> > > > (which of course would require more entries, because smaller pages)?
> > >
> > > I don't know what your question is.
> >
> > I think that refers to VIPT caches, where way size = page size is natural?
> >
> > But K10 got 3 cycle latency with a 64K 2-way L1D, so clearly way size = page size isn't the only way to go.
> >
>
> True, but I have to assume that they looked up the TLB first. And you'd need a
> bigger TLB than K10 if you had 512B pages instead of 4k pages. You could pull
> 16 values, but that wouldn't be any easier than just going to a 16 way cache.
>
> Or possibly they went a step beyond a "way picker" that would predict which page the value was on without the
> full TLB. If so, they never seemed to use it again (with the possible exception of instruction caches).
>
> way size = page size is an extremely good fit, and the exceptions don't seem to repeat.
> I'll note that the early Phenom had a nasty TLB bug (patching it took a performance hit),
> which might make AMD wary of putting the TLB in such a critical part of the path.
The TLB bug of Phenom was only related to TLB fills from caches. It had nothing to do with TLB hits.
I don't remember the exact details but It was something like cache coherency being broken for cached page tables.