By: rwessel (rwessel.delete@this.yahoo.com), February 23, 2021 10:33 am
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on February 23, 2021 8:52 am wrote:
> vvid (no.delete@this.thanks.com) on February 23, 2021 6:41 am wrote:
> > Anon (no.delete@this.spam.com) on February 23, 2021 3:05 am wrote:
> > > anon2 (anon.delete@this.anon.com) on February 22, 2021 7:17 pm wrote:
> > > > I don't think memcpy, memset instructions are bad per se, though I still don't understand
> > > > the fascination with them, unless their proponents are going to move on to do-daxpy, route-ip-packet,
> > > > gzip-memory, etc instructions when/if one day Intel's rep ; mov finally doesn't suck. But
> > > > I digress, the point was not a totally open-ended "ISA does not matter".
> > >
> > > memcpy is quite common, easy to implement in hardware and very inefficient to implement in software.
> >
> > So easy, that it took Intel literally 4 decades to achieve an acceptable* performance of REP MOVSB.
> > * in some situations
>
> Well, in many common situations, rep movs was quite good since at least Nehalem. And
> up to and including i386 it was quite good in different, but even higher proportion
> of common situations (all small copies with lengths unknown in compile time).
> So, unsatisfactory state of affairs where rep movsb was good in too small amount
> of situations lasted less than 20 years. May be less than 15, I am not too sure.
>
> Also, IMHO, even for ideal "rep mosvb" implementation it is acceptable to be few clocks slower
> that wider "rep movsX' variants when length of copy is a small multiple of sizeof(x).
>
>
> Besides, x86 'rep movsb' is too defined in situations where source and destination buffers overlap.
> It would be easier to achieve top performance if overlaps are either undefined (in bounded manner,
> like, never read outside of src[], never write outside of dst[], but other than that a content
> of dst[] can be any mix of zeros and original src and dst bytes) or defined to do nothing.
No one would object to rep/movsb being slow if the operands overlap. IBM even defines MVC that way* on Z (while requiring similar byte-by-byte semantics).
*With an exception for a one byte overlap, which is commonly used to clear storage by replicating the first byte through the area (IOW, memset).
> vvid (no.delete@this.thanks.com) on February 23, 2021 6:41 am wrote:
> > Anon (no.delete@this.spam.com) on February 23, 2021 3:05 am wrote:
> > > anon2 (anon.delete@this.anon.com) on February 22, 2021 7:17 pm wrote:
> > > > I don't think memcpy, memset instructions are bad per se, though I still don't understand
> > > > the fascination with them, unless their proponents are going to move on to do-daxpy, route-ip-packet,
> > > > gzip-memory, etc instructions when/if one day Intel's rep ; mov finally doesn't suck. But
> > > > I digress, the point was not a totally open-ended "ISA does not matter".
> > >
> > > memcpy is quite common, easy to implement in hardware and very inefficient to implement in software.
> >
> > So easy, that it took Intel literally 4 decades to achieve an acceptable* performance of REP MOVSB.
> > * in some situations
>
> Well, in many common situations, rep movs was quite good since at least Nehalem. And
> up to and including i386 it was quite good in different, but even higher proportion
> of common situations (all small copies with lengths unknown in compile time).
> So, unsatisfactory state of affairs where rep movsb was good in too small amount
> of situations lasted less than 20 years. May be less than 15, I am not too sure.
>
> Also, IMHO, even for ideal "rep mosvb" implementation it is acceptable to be few clocks slower
> that wider "rep movsX' variants when length of copy is a small multiple of sizeof(x).
>
>
> Besides, x86 'rep movsb' is too defined in situations where source and destination buffers overlap.
> It would be easier to achieve top performance if overlaps are either undefined (in bounded manner,
> like, never read outside of src[], never write outside of dst[], but other than that a content
> of dst[] can be any mix of zeros and original src and dst bytes) or defined to do nothing.
No one would object to rep/movsb being slow if the operands overlap. IBM even defines MVC that way* on Z (while requiring similar byte-by-byte semantics).
*With an exception for a one byte overlap, which is commonly used to clear storage by replicating the first byte through the area (IOW, memset).