By: Michael S (already5chosen.delete@this.yahoo.com), October 3, 2021 1:51 am
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on October 2, 2021 10:43 am wrote:
> rpg (a.delete@this.b.com) on October 2, 2021 2:51 am wrote:
> >
> > Why handle memcpy with microcoded instructions/cracked uOPs?
> >
> > Wouldn't a simple DMA unit be able to handle this?
>
> DMA units are stupid.
>
> Seriously. Stop perpetuating that myth from the 80s.
>
> Back in the days long long gone, DMA units made sense because
>
> (a) CPU's were often slower than DRAM
>
> (b) caches weren't a thing
>
> and neither of those have been true for decades by now outside of some very very embedded stuff where the
> CPU isn't even remotely the main concern of the hardware (ie there are places where people have a very weak
> CPU that just handles some bookkeeping functionality, and the real heavy lifting is done by specialized hardware
> - very much including DMA engines built into those things. Think networking or media processors).
>
> Also, stop thinking that memory copies are about moving big amounts of data. That's very seldom
> actually true outside of some broken memory throughput benchmarks. The most common thing by
> far is moving small stuff that is a few tens of bytes in size, often isn't cacheline aligned,
> and is quite often somewhere in the cache hierarchy (but not necessarily L1 caches).
>
> The reason you want memset/memmove/memcpy instructions is because
>
> (a) the CPU memory unit already has buffers with byte masking and shifting built in
>
> (b) you should never expose the micro-architectural details of what exactly is the
> buffer size for said masking and shifting, and how many buffers you have etc etc.
>
> (c) you should absolutely not have to bring in the data to the register
> file, because you may be able to keep the data further away
>
> so anybody who says "just use vector instructions" is also wrong.
>
> No, the answer is not some DMA unit, because you'd just be screwing up caches with
> those, or duplicating your existing hardware. The latency of talking to an outside
> unit is higher than the cost of just doing the operation in 90% of all cases.
>
> And no, the answer is not vector units, because you'll just waste an incredible amount of effort and energy
> on trying to deal with the impedance issues of the visible instruction set and architectural state, and the
> low-level details of your memory unit, and you'll not be able to do clever things inside the caches.
>
> Linus
I agree to everything except last paragraph.
The last paragraph is a demonstration of blind hatred.
VU *is* the best technical answer to fixed-width memory copy/set in range from 9B to ~500B. May be, up to ~1000B. It just needs few proper instructions (like load/store register pair with 1B granularity of source/destination) and ubiquity.
> rpg (a.delete@this.b.com) on October 2, 2021 2:51 am wrote:
> >
> > Why handle memcpy with microcoded instructions/cracked uOPs?
> >
> > Wouldn't a simple DMA unit be able to handle this?
>
> DMA units are stupid.
>
> Seriously. Stop perpetuating that myth from the 80s.
>
> Back in the days long long gone, DMA units made sense because
>
> (a) CPU's were often slower than DRAM
>
> (b) caches weren't a thing
>
> and neither of those have been true for decades by now outside of some very very embedded stuff where the
> CPU isn't even remotely the main concern of the hardware (ie there are places where people have a very weak
> CPU that just handles some bookkeeping functionality, and the real heavy lifting is done by specialized hardware
> - very much including DMA engines built into those things. Think networking or media processors).
>
> Also, stop thinking that memory copies are about moving big amounts of data. That's very seldom
> actually true outside of some broken memory throughput benchmarks. The most common thing by
> far is moving small stuff that is a few tens of bytes in size, often isn't cacheline aligned,
> and is quite often somewhere in the cache hierarchy (but not necessarily L1 caches).
>
> The reason you want memset/memmove/memcpy instructions is because
>
> (a) the CPU memory unit already has buffers with byte masking and shifting built in
>
> (b) you should never expose the micro-architectural details of what exactly is the
> buffer size for said masking and shifting, and how many buffers you have etc etc.
>
> (c) you should absolutely not have to bring in the data to the register
> file, because you may be able to keep the data further away
>
> so anybody who says "just use vector instructions" is also wrong.
>
> No, the answer is not some DMA unit, because you'd just be screwing up caches with
> those, or duplicating your existing hardware. The latency of talking to an outside
> unit is higher than the cost of just doing the operation in 90% of all cases.
>
> And no, the answer is not vector units, because you'll just waste an incredible amount of effort and energy
> on trying to deal with the impedance issues of the visible instruction set and architectural state, and the
> low-level details of your memory unit, and you'll not be able to do clever things inside the caches.
>
> Linus
I agree to everything except last paragraph.
The last paragraph is a demonstration of blind hatred.
VU *is* the best technical answer to fixed-width memory copy/set in range from 9B to ~500B. May be, up to ~1000B. It just needs few proper instructions (like load/store register pair with 1B granularity of source/destination) and ubiquity.