By: Etienne Lorrain (etienne_lorrain.delete@this.yahoo.fr), February 24, 2021 7:24 am
Room: Moderated Discussions
Wilco (wilco.dijkstra.delete@this.ntlworld.com) on February 24, 2021 4:37 am wrote:
> Anon (no.delete@this.spam.com) on February 23, 2021 6:26 am wrote:
> > Wilco (wilco.dijkstra.delete@this.ntlworld.com) on February 23, 2021 3:48 am wrote:
> > > You forgot the sarcasm tag :-)
> >
> > Poor implementations don't prove an efficient implementation isn't possible.
>
> Given even a bad software implementation (the SSE2 one) thrashes rep movsb on modern
> cores, it proves that it is not at all trivial to do a good hardware memcpy like
> you suggested. It's not like Intel/AMD haven't been trying for years.
>
> > > bench-memcpy-random in GLIBC shows just how "efficient" rep movsb (__memcpy_erms) is on my 3700X:
> >
> > Your benchmark shows how hard it is to find the optimal software implementation of memcpy, there
> > are 7 variations and a surprising fastest one (__memcpy_sse2_unaligned, what happens to AVX?), this
> > show a somewhat lazy AMD that didn't even put their microcode to emit the best uop sequence.
>
> There isn't a single optimal implementation of memcpy for all possible use-cases. Software allows you
> to select whichever one works best, and you can tweak it further, remove bottlenecks etc. However with
> hardware you are stuck with the one in your CPU. In order for hardware memcpy to work out, it has to
> be as fast as the best software implementation. So far nobody has proven this is feasible.
>
> Wilco
To me, it looks a bit strange to talk about either microcode or hardware for memcpy (with an OoO core):
- microcode is not the exact code which is inserted into the instruction execution windows, if you have a rep movsb with initial ecx=7, you have to fill the execution instruction window with 7 reads of the source address, 7 writes of the destination address (or an optimisation if reading multiple of bytes), and a clear of ecx if still alive. The problem is probably how many execution windows instructions you can insert in one cycle executing microcode.
- hardware memcpy would mean some kind of DMA (into caches) and pausing the execution window?
What is probably needed is specialised "execution window instructions" which can read up to a cache line, another to mask / insert from another cache line, and a third to write up to a cache line. Then the "rep movsb" microcode inserts (maybe a lot of) such "execution window instructions" into the "instructions in flight".
Maybe that is what you meant, then please ignore that message...
> Anon (no.delete@this.spam.com) on February 23, 2021 6:26 am wrote:
> > Wilco (wilco.dijkstra.delete@this.ntlworld.com) on February 23, 2021 3:48 am wrote:
> > > You forgot the sarcasm tag :-)
> >
> > Poor implementations don't prove an efficient implementation isn't possible.
>
> Given even a bad software implementation (the SSE2 one) thrashes rep movsb on modern
> cores, it proves that it is not at all trivial to do a good hardware memcpy like
> you suggested. It's not like Intel/AMD haven't been trying for years.
>
> > > bench-memcpy-random in GLIBC shows just how "efficient" rep movsb (__memcpy_erms) is on my 3700X:
> >
> > Your benchmark shows how hard it is to find the optimal software implementation of memcpy, there
> > are 7 variations and a surprising fastest one (__memcpy_sse2_unaligned, what happens to AVX?), this
> > show a somewhat lazy AMD that didn't even put their microcode to emit the best uop sequence.
>
> There isn't a single optimal implementation of memcpy for all possible use-cases. Software allows you
> to select whichever one works best, and you can tweak it further, remove bottlenecks etc. However with
> hardware you are stuck with the one in your CPU. In order for hardware memcpy to work out, it has to
> be as fast as the best software implementation. So far nobody has proven this is feasible.
>
> Wilco
To me, it looks a bit strange to talk about either microcode or hardware for memcpy (with an OoO core):
- microcode is not the exact code which is inserted into the instruction execution windows, if you have a rep movsb with initial ecx=7, you have to fill the execution instruction window with 7 reads of the source address, 7 writes of the destination address (or an optimisation if reading multiple of bytes), and a clear of ecx if still alive. The problem is probably how many execution windows instructions you can insert in one cycle executing microcode.
- hardware memcpy would mean some kind of DMA (into caches) and pausing the execution window?
What is probably needed is specialised "execution window instructions" which can read up to a cache line, another to mask / insert from another cache line, and a third to write up to a cache line. Then the "rep movsb" microcode inserts (maybe a lot of) such "execution window instructions" into the "instructions in flight".
Maybe that is what you meant, then please ignore that message...