By: rwessel (rwessel.delete@this.yahoo.com), October 3, 2021 5:06 am
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on October 3, 2021 1:51 am wrote:
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on October 2, 2021 10:43 am wrote:
> > rpg (a.delete@this.b.com) on October 2, 2021 2:51 am wrote:
> > >
> > > Why handle memcpy with microcoded instructions/cracked uOPs?
> > >
> > > Wouldn't a simple DMA unit be able to handle this?
> >
> > DMA units are stupid.
> >
> > Seriously. Stop perpetuating that myth from the 80s.
> >
> > Back in the days long long gone, DMA units made sense because
> >
> > (a) CPU's were often slower than DRAM
> >
> > (b) caches weren't a thing
> >
> > and neither of those have been true for decades by now outside of some very very embedded stuff where the
> > CPU isn't even remotely the main concern of the hardware (ie there are places where people have a very weak
> > CPU that just handles some bookkeeping functionality, and
> > the real heavy lifting is done by specialized hardware
> > - very much including DMA engines built into those things. Think networking or media processors).
> >
> > Also, stop thinking that memory copies are about moving big amounts of data. That's very seldom
> > actually true outside of some broken memory throughput benchmarks. The most common thing by
> > far is moving small stuff that is a few tens of bytes in size, often isn't cacheline aligned,
> > and is quite often somewhere in the cache hierarchy (but not necessarily L1 caches).
> >
> > The reason you want memset/memmove/memcpy instructions is because
> >
> > (a) the CPU memory unit already has buffers with byte masking and shifting built in
> >
> > (b) you should never expose the micro-architectural details of what exactly is the
> > buffer size for said masking and shifting, and how many buffers you have etc etc.
> >
> > (c) you should absolutely not have to bring in the data to the register
> > file, because you may be able to keep the data further away
> >
> > so anybody who says "just use vector instructions" is also wrong.
> >
> > No, the answer is not some DMA unit, because you'd just be screwing up caches with
> > those, or duplicating your existing hardware. The latency of talking to an outside
> > unit is higher than the cost of just doing the operation in 90% of all cases.
> >
> > And no, the answer is not vector units, because you'll just waste an incredible amount of effort and energy
> > on trying to deal with the impedance issues of the visible instruction set and architectural state, and the
> > low-level details of your memory unit, and you'll not be able to do clever things inside the caches.
> >
> > Linus
>
> I agree to everything except last paragraph.
> The last paragraph is a demonstration of blind hatred.
> VU *is* the best technical answer to fixed-width memory copy/set in range from
> 9B to ~500B. May be, up to ~1000B. It just needs few proper instructions (like load/store
> register pair with 1B granularity of source/destination) and ubiquity.
Only if the vector unit doesn't need to be powered up, and the save of its state can be amortized over other work. In a lot of kernel code, the VU isn't available period.
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on October 2, 2021 10:43 am wrote:
> > rpg (a.delete@this.b.com) on October 2, 2021 2:51 am wrote:
> > >
> > > Why handle memcpy with microcoded instructions/cracked uOPs?
> > >
> > > Wouldn't a simple DMA unit be able to handle this?
> >
> > DMA units are stupid.
> >
> > Seriously. Stop perpetuating that myth from the 80s.
> >
> > Back in the days long long gone, DMA units made sense because
> >
> > (a) CPU's were often slower than DRAM
> >
> > (b) caches weren't a thing
> >
> > and neither of those have been true for decades by now outside of some very very embedded stuff where the
> > CPU isn't even remotely the main concern of the hardware (ie there are places where people have a very weak
> > CPU that just handles some bookkeeping functionality, and
> > the real heavy lifting is done by specialized hardware
> > - very much including DMA engines built into those things. Think networking or media processors).
> >
> > Also, stop thinking that memory copies are about moving big amounts of data. That's very seldom
> > actually true outside of some broken memory throughput benchmarks. The most common thing by
> > far is moving small stuff that is a few tens of bytes in size, often isn't cacheline aligned,
> > and is quite often somewhere in the cache hierarchy (but not necessarily L1 caches).
> >
> > The reason you want memset/memmove/memcpy instructions is because
> >
> > (a) the CPU memory unit already has buffers with byte masking and shifting built in
> >
> > (b) you should never expose the micro-architectural details of what exactly is the
> > buffer size for said masking and shifting, and how many buffers you have etc etc.
> >
> > (c) you should absolutely not have to bring in the data to the register
> > file, because you may be able to keep the data further away
> >
> > so anybody who says "just use vector instructions" is also wrong.
> >
> > No, the answer is not some DMA unit, because you'd just be screwing up caches with
> > those, or duplicating your existing hardware. The latency of talking to an outside
> > unit is higher than the cost of just doing the operation in 90% of all cases.
> >
> > And no, the answer is not vector units, because you'll just waste an incredible amount of effort and energy
> > on trying to deal with the impedance issues of the visible instruction set and architectural state, and the
> > low-level details of your memory unit, and you'll not be able to do clever things inside the caches.
> >
> > Linus
>
> I agree to everything except last paragraph.
> The last paragraph is a demonstration of blind hatred.
> VU *is* the best technical answer to fixed-width memory copy/set in range from
> 9B to ~500B. May be, up to ~1000B. It just needs few proper instructions (like load/store
> register pair with 1B granularity of source/destination) and ubiquity.
Only if the vector unit doesn't need to be powered up, and the save of its state can be amortized over other work. In a lot of kernel code, the VU isn't available period.