How is MASKMOVDQU implemented?

By: Travis Downs (travis.downs.delete@this.gmail.com), February 28, 2019 10:49 am
Room: Moderated Discussions
... or how would it likely be implemented?

This instruction is special because, in principle and in practice if you believe the documentation, it lets you do "fire and forget" (aka non-allocating) protocol stores but with byte granularity: merging the bytes you want to write with the existing value of the ones you don't.

At least that's the promise and Intel goes so far as to put it in the SDM:


The MASKMOVDQU instruction can be used to improve performance of algorithms that need to merge data on a byte-by-byte basis. MASKMOVDQU should not cause a read for ownership; doing so generates unnecessary bandwidth since data is to be written directly using the byte-mask without allocating old data prior to the store.


The use case isn't all that exotic: it would be useful for any case where you are doing sparse writes of less than a cache line in an out-of-cache data set with little locality of reference (so caching doesn't help).

How would you implement such a beast? I guess it implies a 64-bit "valid byte bitmap" starting in the store buffer, and then in all buffers in towards the outer levels of cache. As far as I can tell, you would probably already need that anyways for regular non-temporal stores (which don't take a mask) when you write less than a cache line before the WC buffer gets evicted.

However, where does the merging stop? Does the mask get pushed all the way down to the memory controller and then does the memory controller actually push the byte-granular writes down to DRAM? I guess modern DDR still has that capability, but it seems heavily optimized for burst transfers, so it's unclear to me how efficient it would be.

Or perhaps the mask is only pushed down so far, and at some point a read of the whole line occurs, the merging happens and the line is written back - maybe in the L3 or in the MC.

Any ideas?

Intel never bothered to extend these instructions to 256 or 512 bits with AVX or AVX-512, so maybe they are on life support and are unlikely to have efficient implementations in the future.
 Next Post in Thread >
TopicPosted ByDate
How is MASKMOVDQU implemented?Travis Downs2019/02/28 10:49 AM
  RMW at memory at least (I think)Paul A. Clayton2019/02/28 08:53 PM
    RMW at memory at least (I think)Linus Torvalds2019/03/01 12:41 PM
  How is MASKMOVDQU implemented?anon2019/03/01 05:11 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell green?