By: Brett (ggtgp.delete@this.yahoo.com), October 2, 2021 11:45 am
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on October 2, 2021 10:43 am wrote:
> rpg (a.delete@this.b.com) on October 2, 2021 2:51 am wrote:
> >
> > Why handle memcpy with microcoded instructions/cracked uOPs?
> >
> > Wouldn't a simple DMA unit be able to handle this?
>
> DMA units are stupid.
>
> Seriously. Stop perpetuating that myth from the 80s.
>
> Back in the days long long gone, DMA units made sense because
>
> (a) CPU's were often slower than DRAM
>
> (b) caches weren't a thing
>
> and neither of those have been true for decades by now outside of some very very embedded stuff where the
> CPU isn't even remotely the main concern of the hardware (ie there are places where people have a very weak
> CPU that just handles some bookkeeping functionality, and the real heavy lifting is done by specialized hardware
> - very much including DMA engines built into those things. Think networking or media processors).
>
> Also, stop thinking that memory copies are about moving big amounts of data. That's very seldom
> actually true outside of some broken memory throughput benchmarks. The most common thing by
> far is moving small stuff that is a few tens of bytes in size, often isn't cacheline aligned,
> and is quite often somewhere in the cache hierarchy (but not necessarily L1 caches).
Yes, a good example is computer languages that clear structures before use, so it executes memset on 10 bytes across a page boundary and the next instruction may use the structure by setting the last byte to a value. The CPU OoO read/write hazard system has to be intimately involved making an external DMA unit impossible. The same is true of memcpy/memmove.
> The reason you want memset/memmove/memcpy instructions is because
>
> (a) the CPU memory unit already has buffers with byte masking and shifting built in
>
> (b) you should never expose the micro-architectural details of what exactly is the
> buffer size for said masking and shifting, and how many buffers you have etc etc.
>
> (c) you should absolutely not have to bring in the data to the register
> file, because you may be able to keep the data further away
>
> so anybody who says "just use vector instructions" is also wrong.
>
> No, the answer is not some DMA unit, because you'd just be screwing up caches with
> those, or duplicating your existing hardware. The latency of talking to an outside
> unit is higher than the cost of just doing the operation in 90% of all cases.
>
> And no, the answer is not vector units, because you'll just waste an incredible amount of effort and energy
> on trying to deal with the impedance issues of the visible instruction set and architectural state, and the
Rename registers do not effect visible architectural state.
Small copy counts will not engage the vector unit overheat clock slowdown state.
The load/store unit will take care of the byte shifting, so you just need a rename that runs through bypasses, only touching the rename vector register file as a side effect.
Big copy counts will induce clock slowdowns from the heat of transferring so much data. Which the vector unit will get wrongly blamed for.
> low-level details of your memory unit, and you'll not be able to do clever things inside the caches.
>
> Linus
> rpg (a.delete@this.b.com) on October 2, 2021 2:51 am wrote:
> >
> > Why handle memcpy with microcoded instructions/cracked uOPs?
> >
> > Wouldn't a simple DMA unit be able to handle this?
>
> DMA units are stupid.
>
> Seriously. Stop perpetuating that myth from the 80s.
>
> Back in the days long long gone, DMA units made sense because
>
> (a) CPU's were often slower than DRAM
>
> (b) caches weren't a thing
>
> and neither of those have been true for decades by now outside of some very very embedded stuff where the
> CPU isn't even remotely the main concern of the hardware (ie there are places where people have a very weak
> CPU that just handles some bookkeeping functionality, and the real heavy lifting is done by specialized hardware
> - very much including DMA engines built into those things. Think networking or media processors).
>
> Also, stop thinking that memory copies are about moving big amounts of data. That's very seldom
> actually true outside of some broken memory throughput benchmarks. The most common thing by
> far is moving small stuff that is a few tens of bytes in size, often isn't cacheline aligned,
> and is quite often somewhere in the cache hierarchy (but not necessarily L1 caches).
Yes, a good example is computer languages that clear structures before use, so it executes memset on 10 bytes across a page boundary and the next instruction may use the structure by setting the last byte to a value. The CPU OoO read/write hazard system has to be intimately involved making an external DMA unit impossible. The same is true of memcpy/memmove.
> The reason you want memset/memmove/memcpy instructions is because
>
> (a) the CPU memory unit already has buffers with byte masking and shifting built in
>
> (b) you should never expose the micro-architectural details of what exactly is the
> buffer size for said masking and shifting, and how many buffers you have etc etc.
>
> (c) you should absolutely not have to bring in the data to the register
> file, because you may be able to keep the data further away
>
> so anybody who says "just use vector instructions" is also wrong.
>
> No, the answer is not some DMA unit, because you'd just be screwing up caches with
> those, or duplicating your existing hardware. The latency of talking to an outside
> unit is higher than the cost of just doing the operation in 90% of all cases.
>
> And no, the answer is not vector units, because you'll just waste an incredible amount of effort and energy
> on trying to deal with the impedance issues of the visible instruction set and architectural state, and the
Rename registers do not effect visible architectural state.
Small copy counts will not engage the vector unit overheat clock slowdown state.
The load/store unit will take care of the byte shifting, so you just need a rename that runs through bypasses, only touching the rename vector register file as a side effect.
Big copy counts will induce clock slowdowns from the heat of transferring so much data. Which the vector unit will get wrongly blamed for.
> low-level details of your memory unit, and you'll not be able to do clever things inside the caches.
>
> Linus