By: --- (---.delete@this.redheron.com), September 8, 2021 2:28 pm
Room: Moderated Discussions
sr (nobody.delete@this.nowhere.com) on September 8, 2021 11:35 am wrote:
> Hugo Décharnes (hdecharn.delete@this.outlook.fr) on September 8, 2021 10:46 am wrote:
> > You just repeated what I said, that is, VIVT is not the way to go.
> >
> > BTW, L2 access time is critical, and TLB coverage (not accuracy) does not compensate, in that most
> > workload are bounded by the cache latency, not the TLB coverage. Yes, VIVPT enables to move TLB away
> > from the time-critical load-return path, enabling bigger TLBs. But you must reduce the L2 hit latency
> > as much as possible. We are talking about percents of performance for a typical big core.
>
> Point for VIVT L1 is that there's zero need to burn power for TLB translation for L1 hit.
> Data already in cache can accessed and verified without need to translate address. As CPU
> core can do many L1-operations per cycle at least for L1-data TLB needs to be multiported,
> where instead L2 can service only one request per clock so single-ported TLB will be fine.
The claim that a highly performant TLB must be multi-ported is a myth.
The theory is explained here: https://eprints.soton.ac.uk/347147/1/__userfiles.soton.ac.uk_Users_spd _mydesktop _MALEC.pdf
The essential insight is that
- most memory requests are strongly clustered to a single page or a few pages
- to the extent that requests are spread over many pages, that same code is probably not exactly latency essential to the single cycle level
----------------------
It will (of course) come as no surprise that this is exactly what Apple implements.
The Apple D TLB is single-ported, supporting 4 piggyback ports. Hence a single TLB request can service up to four memory requests (read and write mix) in a single cycle.
The TLB is (as best my experimental probing can tell) preceded by four queues, defined by the lowest two bits of the virtual page, and holding up to three or four entries.
Hence the flow is that memory requests (for M1 there can be up to 4 per cycle) are sorted into these queues and, as a practical matter, this pretty much gathers together multiple requests targeting the same page. Servicing queues appears to be either by oldest queue or round robin (my guess is round robin is easier and no practical difference in performance).
Clearly you can create a pathological stream that will behave badly given this design, running at 1 lookup per cycle (I did so as part of my tests). But with almost any realistic pattern, you get three or four lookups per cycle, at a worst case of one to three cycle latency as the four different queues are serviced, and with up to three elements waiting in a queue, all serviced together. I could definitely hit cases where the average number of items serviced was around 2.8 per cycle (ie a mix of pages each cycle, so three same-page references being sorted to the same queue and then being serviced together).
(The same sort of insights [requests cluster to individual cache lines] make naive cache banking work much less well than you might expect, because multiple requests in a given cycle land up routing to the same line.
Apple use the same sort of design for the D-cache, split into two banks by the lowest bit of the cache lineID, with piggyback ports allowing a single line lookup to service multiple requests, and with the two banks providing effectively a dual-porting.)
---------------
For the I-cache, if you care, the design is rather different, with both the physical page number and the cache linedID are preserved from one cycle to the next in temporary register storage, so that most cycles both their timing hit and energy hit can be avoided.
Once again, of course, it works because of strong locality (even more so with 16kiB pages).
Finally I cannot figure out how Apple bypasses D- way prediction, but they don't seem to need it -- at least even the most random stream of addresses (in terms of hitting different ways) that I could generate never seemed to cause even a hiccup in terms of invalid way prediction. I have a few hypotheses, but no strong leads, and no patents suggest an explanation.
On the I- side it's easier -- way info is stored in various appropriate places, like, for example, the Next Fetch Predictor -- and this is described in patents.
> Hugo Décharnes (hdecharn.delete@this.outlook.fr) on September 8, 2021 10:46 am wrote:
> > You just repeated what I said, that is, VIVT is not the way to go.
> >
> > BTW, L2 access time is critical, and TLB coverage (not accuracy) does not compensate, in that most
> > workload are bounded by the cache latency, not the TLB coverage. Yes, VIVPT enables to move TLB away
> > from the time-critical load-return path, enabling bigger TLBs. But you must reduce the L2 hit latency
> > as much as possible. We are talking about percents of performance for a typical big core.
>
> Point for VIVT L1 is that there's zero need to burn power for TLB translation for L1 hit.
> Data already in cache can accessed and verified without need to translate address. As CPU
> core can do many L1-operations per cycle at least for L1-data TLB needs to be multiported,
> where instead L2 can service only one request per clock so single-ported TLB will be fine.
The claim that a highly performant TLB must be multi-ported is a myth.
The theory is explained here: https://eprints.soton.ac.uk/347147/1/__userfiles.soton.ac.uk_Users_spd _mydesktop _MALEC.pdf
The essential insight is that
- most memory requests are strongly clustered to a single page or a few pages
- to the extent that requests are spread over many pages, that same code is probably not exactly latency essential to the single cycle level
----------------------
It will (of course) come as no surprise that this is exactly what Apple implements.
The Apple D TLB is single-ported, supporting 4 piggyback ports. Hence a single TLB request can service up to four memory requests (read and write mix) in a single cycle.
The TLB is (as best my experimental probing can tell) preceded by four queues, defined by the lowest two bits of the virtual page, and holding up to three or four entries.
Hence the flow is that memory requests (for M1 there can be up to 4 per cycle) are sorted into these queues and, as a practical matter, this pretty much gathers together multiple requests targeting the same page. Servicing queues appears to be either by oldest queue or round robin (my guess is round robin is easier and no practical difference in performance).
Clearly you can create a pathological stream that will behave badly given this design, running at 1 lookup per cycle (I did so as part of my tests). But with almost any realistic pattern, you get three or four lookups per cycle, at a worst case of one to three cycle latency as the four different queues are serviced, and with up to three elements waiting in a queue, all serviced together. I could definitely hit cases where the average number of items serviced was around 2.8 per cycle (ie a mix of pages each cycle, so three same-page references being sorted to the same queue and then being serviced together).
(The same sort of insights [requests cluster to individual cache lines] make naive cache banking work much less well than you might expect, because multiple requests in a given cycle land up routing to the same line.
Apple use the same sort of design for the D-cache, split into two banks by the lowest bit of the cache lineID, with piggyback ports allowing a single line lookup to service multiple requests, and with the two banks providing effectively a dual-porting.)
---------------
For the I-cache, if you care, the design is rather different, with both the physical page number and the cache linedID are preserved from one cycle to the next in temporary register storage, so that most cycles both their timing hit and energy hit can be avoided.
Once again, of course, it works because of strong locality (even more so with 16kiB pages).
Finally I cannot figure out how Apple bypasses D- way prediction, but they don't seem to need it -- at least even the most random stream of addresses (in terms of hitting different ways) that I could generate never seemed to cause even a hiccup in terms of invalid way prediction. I have a few hypotheses, but no strong leads, and no patents suggest an explanation.
On the I- side it's easier -- way info is stored in various appropriate places, like, for example, the Next Fetch Predictor -- and this is described in patents.
Topic | Posted By | Date |
---|---|---|
POWER10 SAP SD benchmark | anon2 | 2021/09/06 03:36 PM |
POWER10 SAP SD benchmark | Daniel B | 2021/09/07 02:31 AM |
"Cores" (and SPEC) | Rayla | 2021/09/07 07:51 AM |
"Cores" (and SPEC) | anon | 2021/09/07 03:56 PM |
POWER10 SAP SD benchmark | Anon | 2021/09/07 03:24 PM |
POWER10 SAP SD benchmark | Anon | 2021/09/07 03:27 PM |
Virtually tagged L1-caches | sr | 2021/09/08 05:49 AM |
Virtually tagged L1-caches | dmcq | 2021/09/08 08:22 AM |
Virtually tagged L1-caches | sr | 2021/09/08 08:56 AM |
Virtually tagged L1-caches | Hugo Décharnes | 2021/09/08 08:58 AM |
Virtually tagged L1-caches | sr | 2021/09/08 10:09 AM |
Virtually tagged L1-caches | Hugo Décharnes | 2021/09/08 10:46 AM |
Virtually tagged L1-caches | sr | 2021/09/08 11:35 AM |
Virtually tagged L1-caches | Hugo Décharnes | 2021/09/08 12:23 PM |
Virtually tagged L1-caches | sr | 2021/09/08 12:40 PM |
Virtually tagged L1-caches | anon | 2021/09/09 03:16 AM |
Virtually tagged L1-caches | Konrad Schwarz | 2021/09/10 05:19 AM |
Virtually tagged L1-caches | Hugo Décharnes | 2021/09/10 06:59 AM |
Virtually tagged L1-caches | anon | 2021/09/14 03:17 AM |
Virtually tagged L1-caches | dmcq | 2021/09/14 09:34 AM |
Or use a PLB (NT) | Paul A. Clayton | 2021/09/14 09:45 AM |
Or use a PLB | Linus Torvalds | 2021/09/14 03:27 PM |
Or use a PLB | anon | 2021/09/15 12:15 AM |
Or use a PLB | Michael S | 2021/09/15 03:21 AM |
Or use a PLB | dmcq | 2021/09/15 03:42 PM |
Or use a PLB | Konrad Schwarz | 2021/09/16 04:24 AM |
Or use a PLB | Michael S | 2021/09/16 10:13 AM |
Or use a PLB | --- | 2021/09/16 01:02 PM |
PLB reference | Paul A. Clayton | 2021/09/18 02:35 PM |
PLB reference | Michael S | 2021/09/18 04:14 PM |
Demand paging/translation orthogonal | Paul A. Clayton | 2021/09/19 07:33 AM |
Demand paging/translation orthogonal | Michael S | 2021/09/19 09:10 AM |
PLB reference | Carson | 2021/09/20 10:19 PM |
PLB reference | sr | 2021/09/20 06:02 AM |
PLB reference | Michael S | 2021/09/20 07:03 AM |
PLB reference | Linus Torvalds | 2021/09/20 12:10 PM |
Or use a PLB | sr | 2021/09/20 04:32 AM |
Or use a PLB | sr | 2021/09/21 09:36 AM |
Or use a PLB | Linus Torvalds | 2021/09/21 10:04 AM |
Or use a PLB | sr | 2021/09/21 10:48 AM |
Or use a PLB | Linus Torvalds | 2021/09/21 01:55 PM |
Or use a PLB | sr | 2021/09/22 06:55 AM |
Or use a PLB | rwessel | 2021/09/22 07:09 AM |
Or use a PLB | Linus Torvalds | 2021/09/22 11:50 AM |
Or use a PLB | sr | 2021/09/22 01:00 PM |
Or use a PLB | dmcq | 2021/09/22 04:07 PM |
Or use a PLB | Etienne Lorrain | 2021/09/23 08:50 AM |
Or use a PLB | anon2 | 2021/09/22 04:09 PM |
Or use a PLB | dmcq | 2021/09/23 02:35 AM |
Or use a PLB | ⚛ | 2021/09/23 09:37 AM |
Or use a PLB | Linus Torvalds | 2021/09/23 12:01 PM |
Or use a PLB | gpd | 2021/09/24 03:59 AM |
Or use a PLB | Linus Torvalds | 2021/09/24 10:45 AM |
Or use a PLB | dmcq | 2021/09/24 12:43 PM |
Or use a PLB | sr | 2021/09/25 10:19 AM |
Or use a PLB | Linus Torvalds | 2021/09/25 10:44 AM |
Or use a PLB | sr | 2021/09/25 11:11 AM |
Or use a PLB | Linus Torvalds | 2021/09/25 11:31 AM |
Or use a PLB | sr | 2021/09/25 11:52 AM |
Or use a PLB | Linus Torvalds | 2021/09/25 12:05 PM |
Or use a PLB | sr | 2021/09/25 12:23 PM |
Or use a PLB | rwessel | 2021/09/25 03:29 PM |
Or use a PLB | sr | 2021/10/01 12:22 AM |
Or use a PLB | rwessel | 2021/10/01 06:19 AM |
Or use a PLB | David Hess | 2021/10/01 10:35 AM |
Or use a PLB | rwessel | 2021/10/02 04:47 AM |
Or use a PLB | sr | 2021/10/02 11:16 AM |
Or use a PLB | rwessel | 2021/10/02 11:53 AM |
Or use a PLB | Linus Torvalds | 2021/09/25 11:57 AM |
Or use a PLB | sr | 2021/09/25 12:07 PM |
Or use a PLB | Linus Torvalds | 2021/09/25 12:21 PM |
Or use a PLB | sr | 2021/09/25 12:40 PM |
Or use a PLB | nksingh | 2021/09/27 09:07 AM |
Or use a PLB | ⚛ | 2021/09/27 09:02 AM |
Or use a PLB | Linus Torvalds | 2021/09/27 10:20 AM |
Or use a PLB | Linus Torvalds | 2021/09/27 12:58 PM |
Or use a PLB | dmcq | 2021/09/28 10:59 AM |
Or use a PLB | sr | 2021/09/25 10:34 AM |
Or use a PLB | rwessel | 2021/09/25 03:44 PM |
Or use a PLB | sr | 2021/10/01 01:04 AM |
Or use a PLB | rwessel | 2021/10/01 06:33 AM |
I386 segmentation highlights | sr | 2021/10/04 07:53 AM |
I386 segmentation highlights | Adrian | 2021/10/04 09:53 AM |
I386 segmentation highlights | sr | 2021/10/04 10:19 AM |
I386 segmentation highlights | rwessel | 2021/10/04 04:57 PM |
I386 segmentation highlights | sr | 2021/10/05 11:16 AM |
I386 segmentation highlights | Michael S | 2021/10/05 12:27 PM |
I386 segmentation highlights | rwessel | 2021/10/05 04:20 PM |
Or use a PLB | JohnG | 2021/09/25 10:18 PM |
Or use a PLB | ⚛ | 2021/09/27 07:37 AM |
Or use a PLB | Heikki Kultala | 2021/09/28 03:53 AM |
Or use a PLB | rwessel | 2021/09/28 07:29 AM |
Or use a PLB | David Hess | 2021/09/23 06:00 PM |
Or use a PLB | Adrian | 2021/09/24 01:21 AM |
Or use a PLB | dmcq | 2021/09/25 12:41 PM |
Or use a PLB | blaine | 2021/09/26 11:19 PM |
Or use a PLB | David Hess | 2021/09/27 11:35 AM |
Or use a PLB | blaine | 2021/09/27 05:19 PM |
Or use a PLB | Adrian | 2021/09/27 10:40 PM |
Or use a PLB | Adrian | 2021/09/27 10:59 PM |
Or use a PLB | dmcq | 2021/09/28 07:45 AM |
Or use a PLB | rwessel | 2021/09/28 07:45 AM |
Or use a PLB | David Hess | 2021/09/28 12:50 PM |
Or use a PLB | Etienne Lorrain | 2021/09/30 01:25 AM |
Or use a PLB | David Hess | 2021/10/01 10:40 AM |
MMU privileges | sr | 2021/09/21 11:07 AM |
MMU privileges | Linus Torvalds | 2021/09/21 01:49 PM |
Virtually tagged L1-caches | Konrad Schwarz | 2021/09/16 04:18 AM |
Virtually tagged L1-caches | Carson | 2021/09/16 01:12 PM |
Virtually tagged L1-caches | anon2 | 2021/09/16 05:16 PM |
Virtually tagged L1-caches | rwessel | 2021/09/16 06:29 PM |
Virtually tagged L1-caches | sr | 2021/09/20 04:20 AM |
Virtually tagged L1-caches | --- | 2021/09/08 02:28 PM |
Virtually tagged L1-caches | anonymou5 | 2021/09/08 08:28 PM |
Virtually tagged L1-caches | anonymou5 | 2021/09/08 08:34 PM |
Virtually tagged L1-caches | --- | 2021/09/09 10:14 AM |
Virtually tagged L1-caches | anonymou5 | 2021/09/09 10:44 PM |
Multi-threading? | David Kanter | 2021/09/09 09:32 PM |
Multi-threading? | --- | 2021/09/10 09:19 AM |
Virtually tagged L1-caches | sr | 2021/09/11 01:19 AM |
Virtually tagged L1-caches | sr | 2021/09/11 01:36 AM |
Virtually tagged L1-caches | --- | 2021/09/11 09:53 AM |
Virtually tagged L1-caches | sr | 2021/09/12 12:43 AM |
Virtually tagged L1-caches | Linus Torvalds | 2021/09/12 11:10 AM |
Virtually tagged L1-caches | sr | 2021/09/12 11:57 AM |
Virtually tagged L1-caches | dmcq | 2021/09/13 08:31 AM |
Virtually tagged L1-caches | sr | 2021/09/20 04:11 AM |
Virtually tagged L1-caches | sr | 2021/09/11 02:49 AM |
Virtually tagged L1-caches | Linus Torvalds | 2021/09/08 12:34 PM |
Virtually tagged L1-caches | dmcq | 2021/09/09 02:46 AM |
Virtually tagged L1-caches | dmcq | 2021/09/09 02:58 AM |
Virtually tagged L1-caches | sr | 2021/09/11 01:29 AM |
Virtually tagged L1-caches | dmcq | 2021/09/11 08:59 AM |
Virtually tagged L1-caches | sr | 2021/09/12 12:57 AM |
Virtually tagged L1-caches | dmcq | 2021/09/12 08:44 AM |
Virtually tagged L1-caches | sr | 2021/09/12 09:48 AM |
Virtually tagged L1-caches | dmcq | 2021/09/12 01:22 PM |
Virtually tagged L1-caches | sr | 2021/09/20 04:40 AM |
Where do you see this information? (NT) | anon2 | 2021/09/09 02:45 AM |
Where do you see this information? | sr | 2021/09/11 01:40 AM |
Where do you see this information? | anon2 | 2021/09/11 01:53 AM |
Where do you see this information? | sr | 2021/09/11 02:08 AM |
Thank you (NT) | anon2 | 2021/09/11 04:31 PM |