By: wumpus (wumpus.delete@this.lost.in.a.hole), May 6, 2021 8:06 am
Room: Moderated Discussions
Chester (lamchester.delete@this.gmail.com) on May 5, 2021 6:45 pm wrote:
> > > Assuming you still want to keep "L1 "way size" = page size, that gives you
> > > 8 cachelines per "way". I think once ARM made a 32-way L1 cache that they
> > > claimed it was faster as 32-way, but that was certainly the exception.
> > >
> > > Do you want "60-way" caches?
> >
> > No.
> >
> > > Add some sort of inital TLB lookup to the L1 latency
> > > (which of course would require more entries, because smaller pages)?
> >
> > I don't know what your question is.
>
> I think that refers to VIPT caches, where way size = page size is natural?
>
> But K10 got 3 cycle latency with a 64K 2-way L1D, so clearly way size = page size isn't the only way to go.
>
True, but I have to assume that they looked up the TLB first. And you'd need a bigger TLB than K10 if you had 512B pages instead of 4k pages. You could pull 16 values, but that wouldn't be any easier than just going to a 16 way cache.
Or possibly they went a step beyond a "way picker" that would predict which page the value was on without the full TLB. If so, they never seemed to use it again (with the possible exception of instruction caches).
way size = page size is an extremely good fit, and the exceptions don't seem to repeat. I'll note that the early Phenom had a nasty TLB bug (patching it took a performance hit), which might make AMD wary of putting the TLB in such a critical part of the path.
> > > Assuming you still want to keep "L1 "way size" = page size, that gives you
> > > 8 cachelines per "way". I think once ARM made a 32-way L1 cache that they
> > > claimed it was faster as 32-way, but that was certainly the exception.
> > >
> > > Do you want "60-way" caches?
> >
> > No.
> >
> > > Add some sort of inital TLB lookup to the L1 latency
> > > (which of course would require more entries, because smaller pages)?
> >
> > I don't know what your question is.
>
> I think that refers to VIPT caches, where way size = page size is natural?
>
> But K10 got 3 cycle latency with a 64K 2-way L1D, so clearly way size = page size isn't the only way to go.
>
True, but I have to assume that they looked up the TLB first. And you'd need a bigger TLB than K10 if you had 512B pages instead of 4k pages. You could pull 16 values, but that wouldn't be any easier than just going to a 16 way cache.
Or possibly they went a step beyond a "way picker" that would predict which page the value was on without the full TLB. If so, they never seemed to use it again (with the possible exception of instruction caches).
way size = page size is an extremely good fit, and the exceptions don't seem to repeat. I'll note that the early Phenom had a nasty TLB bug (patching it took a performance hit), which might make AMD wary of putting the TLB in such a critical part of the path.