By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), October 2, 2021 10:43 am
Room: Moderated Discussions
rpg (a.delete@this.b.com) on October 2, 2021 2:51 am wrote:
>
> Why handle memcpy with microcoded instructions/cracked uOPs?
>
> Wouldn't a simple DMA unit be able to handle this?
DMA units are stupid.
Seriously. Stop perpetuating that myth from the 80s.
Back in the days long long gone, DMA units made sense because
(a) CPU's were often slower than DRAM
(b) caches weren't a thing
and neither of those have been true for decades by now outside of some very very embedded stuff where the CPU isn't even remotely the main concern of the hardware (ie there are places where people have a very weak CPU that just handles some bookkeeping functionality, and the real heavy lifting is done by specialized hardware - very much including DMA engines built into those things. Think networking or media processors).
Also, stop thinking that memory copies are about moving big amounts of data. That's very seldom actually true outside of some broken memory throughput benchmarks. The most common thing by far is moving small stuff that is a few tens of bytes in size, often isn't cacheline aligned, and is quite often somewhere in the cache hierarchy (but not necessarily L1 caches).
The reason you want memset/memmove/memcpy instructions is because
(a) the CPU memory unit already has buffers with byte masking and shifting built in
(b) you should never expose the micro-architectural details of what exactly is the buffer size for said masking and shifting, and how many buffers you have etc etc.
(c) you should absolutely not have to bring in the data to the register file, because you may be able to keep the data further away
so anybody who says "just use vector instructions" is also wrong.
No, the answer is not some DMA unit, because you'd just be screwing up caches with those, or duplicating your existing hardware. The latency of talking to an outside unit is higher than the cost of just doing the operation in 90% of all cases.
And no, the answer is not vector units, because you'll just waste an incredible amount of effort and energy on trying to deal with the impedance issues of the visible instruction set and architectural state, and the low-level details of your memory unit, and you'll not be able to do clever things inside the caches.
Linus
>
> Why handle memcpy with microcoded instructions/cracked uOPs?
>
> Wouldn't a simple DMA unit be able to handle this?
DMA units are stupid.
Seriously. Stop perpetuating that myth from the 80s.
Back in the days long long gone, DMA units made sense because
(a) CPU's were often slower than DRAM
(b) caches weren't a thing
and neither of those have been true for decades by now outside of some very very embedded stuff where the CPU isn't even remotely the main concern of the hardware (ie there are places where people have a very weak CPU that just handles some bookkeeping functionality, and the real heavy lifting is done by specialized hardware - very much including DMA engines built into those things. Think networking or media processors).
Also, stop thinking that memory copies are about moving big amounts of data. That's very seldom actually true outside of some broken memory throughput benchmarks. The most common thing by far is moving small stuff that is a few tens of bytes in size, often isn't cacheline aligned, and is quite often somewhere in the cache hierarchy (but not necessarily L1 caches).
The reason you want memset/memmove/memcpy instructions is because
(a) the CPU memory unit already has buffers with byte masking and shifting built in
(b) you should never expose the micro-architectural details of what exactly is the buffer size for said masking and shifting, and how many buffers you have etc etc.
(c) you should absolutely not have to bring in the data to the register file, because you may be able to keep the data further away
so anybody who says "just use vector instructions" is also wrong.
No, the answer is not some DMA unit, because you'd just be screwing up caches with those, or duplicating your existing hardware. The latency of talking to an outside unit is higher than the cost of just doing the operation in 90% of all cases.
And no, the answer is not vector units, because you'll just waste an incredible amount of effort and energy on trying to deal with the impedance issues of the visible instruction set and architectural state, and the low-level details of your memory unit, and you'll not be able to do clever things inside the caches.
Linus