V-way-stylecache for compression and tag-only inclusion

By: --- (---.delete@this.redheron.com), April 7, 2022 12:58 pm
Room: Moderated Discussions
Paul A. Clayton (paaronclayton.delete@this.gmail.com) on April 7, 2022 12:12 pm wrote:
> --- (---.delete@this.redheron.com) on April 6, 2022 10:34 am wrote:
> [snip]
> > The simple (add a bit in the tag) scheme used by Apple (and, for all I know, AMD and now Intel) is an easy
> > retrofit, but only really saves you power, and leaves an SRAM line unused.
>
> Using a mechanism like that for the V-way cache (where more tags are provided than data blocks and NUCA-inspired
> indirection maps to storage while reducing the number of tags checked at a given index — some cache compression
> schemes use more limited indirection but provide more tags than worst-case data storage required) would
> provide additional tags to facilitate more general compression, tag-inclusion without data inclusion, and
> perhaps even copy optimization (support for lossy compression with multiple partial block hits might be
> useful to facilitate non-block-aligned copies). Extra tags might also be useful for estimating reuse distance
> (particularly for 'almost fitting' working sets?) for better replacement choices and perhaps for prefetch
> metadata (cache compression could also provide storage for metadata).

Your points (the presence of extra tags, and indirect cache placement) are well-taken and (just to put them on the record) Apple certainly (at least according to the patent record) already have both of these in place - extra tags, definitely used to allow L2 to snoop filter L1, and likewise SLC to snoop filter L2, and indirect block placement.

Using these gets you a little more zero block leverage, but still not optimal. The ZCA idea of using a tag that covers more lines, along with a bit map, strikes me as a better design point once you decide you are going to do this seriously.

Of course the other way to go (and this possibility is mentioned in the "everything else we can think of to possibly patent" section of Apple's Sparse Cache Data patent) is to work towards a generic compressed cache, meaning that you have extra bits attached to every tag anyway, and now it's just a question of how you define those extra tags to cover various forms of compression...


> (Main memory filling gaps from lossy compression is one possibility. Theoretically,
> lost information might also be recoverable via computation or treated as unavailable
> in a timely manner — approximate computing has broad uses.)
>
> > a parallel
> > cache (probably best done at L3) that consists of only tags, no data lines, and associates with each tag a
> > bitmap of 8 (or 16 or 32 or ...) bits marking the associated
> > lines as zero-only. The lines are assumed to be
> > used in similar ways so that MOESI tags can be shared (or
> > fairly simple additional bits added per line). Obviously
> > the tag lookup needs to be slightly different given that
> > one tag covers a larger area. The payoff is a small
> > amount of extra area that covers most of the common use
> > cases, from zero'd pages to to zeroing data structures
> > and arrays, while saving a fair amount of power through reduced SRAM and DRAM accesses.
> > Of course you need to ensure that the extra tag accesses do not squander
> > the saved SRAM accesses, but overall the scheme looks like a win.
>
> I suspect one would want a multi-level design providing fast read and write by a core. It seems likely that
> utilization of the tag storage and checking hardware could be increased by providing additional uses.
>
> (The Mill's backless store uses PTEs to mark not-present zero pages which automatically
> get a physical page on eviction from last level cache [one benefit of virtual caches].
> The Mill also has zero-on-allocation stack frames. [Using present tense for a paper and
> not yet finalized design seems a bit off, but the ideas have been presented.])

Apple has something of a version of that in the same sparsity patent, the idea being to mark a page as zero'd in the PTE, and supply its (zero) data at some point that's earlier in the load process, and so lower power, probably at the SLC, than going all the way to DRAM.
https://patents.google.com/patent/US10691610B2
Like the cache use case, if the primary goal (for now anyway) is power savings, it's difficult to see whether it's implemented or not.

Once again this gives you a win on both the write and read side (though handling coherence could be a bit tricky! presumably some sort of "flash invalidate every line in this address range" message sent to every cache?) so it's possible that it's more an aspirational idea than yet implemented.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Intel directly copies Zen presentation for Ocean Cove patentphoson2022/04/06 08:03 AM
  Intel directly copies Zen presentation for Ocean Cove patentzArchJon2022/04/06 10:20 AM
    Intel directly copies Zen presentation for Ocean Cove patent---2022/04/06 10:34 AM
      Intel directly copies Zen presentation for Ocean Cove patentanon2022/04/06 11:42 AM
        Intel directly copies Zen presentation for Ocean Cove patent---2022/04/06 02:49 PM
      V-way-stylecache for compression and tag-only inclusionPaul A. Clayton2022/04/07 12:12 PM
        V-way-stylecache for compression and tag-only inclusion---2022/04/07 12:58 PM
  Intel directly copies Zen presentation for Ocean Cove patent---2022/04/06 10:21 AM
  Intel directly copies Zen presentation for Ocean Cove patentaaron spink2022/04/06 05:31 PM
    Intel directly copies Zen presentation for Ocean Cove patentDoug S2022/04/06 10:45 PM
      Intel directly copies Zen presentation for Ocean Cove patentDoug S2022/04/06 10:53 PM
        Intel directly copies Zen presentation for Ocean Cove patentAdrian2022/04/07 05:45 AM
          The one who twitted does does not know to read patents.Adrian2022/04/07 05:55 AM
            The one who twitted does does not know to read patents.me2022/04/07 10:15 AM
              The one who twitted does does not know to read patents.Anon2022/04/08 12:53 AM
              The one who twitted does does not know to read patents.Adrian2022/04/08 03:49 AM
            but what about copyrighthobold2022/04/08 03:44 AM
              but what about copyrightAdrian2022/04/08 04:00 AM
                but what about copyrightAdrian2022/04/08 04:10 AM
              but what about copyrightaaron spink2022/04/09 05:28 AM
                but what about copyrightMatt Sayler2022/04/09 07:27 AM
                  but what about copyrightAdrian2022/04/09 09:39 AM
                    but what about copyrighthobold2022/04/09 12:01 PM
                      prior artanonymou52022/04/09 07:19 PM
                      but what about copyrightUngo2022/04/09 11:06 PM
                      but what about copyrightAdrian2022/04/10 03:53 AM
      Intel directly copies Zen presentation for Ocean Cove patentblaine2022/04/08 01:52 PM
        Intel directly copies Zen presentation for Ocean Cove patentanon22022/04/08 06:41 PM
        Intel directly copies Zen presentation for Ocean Cove patentaaron spink2022/04/09 05:31 AM
          Intel directly copies Zen presentation for Ocean Cove patentblaine2022/04/11 10:06 AM
            Intel directly copies Zen presentation for Ocean Cove patentMatt Sayler2022/04/13 10:34 AM
            Intel directly copies Zen presentation for Ocean Cove patentaaron spink2022/04/14 02:18 AM
            Intel directly copies Zen presentation for Ocean Cove patentanon22022/04/14 10:26 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊