By: Michael S (already5chosen.delete@this.yahoo.com), October 3, 2021 5:24 am
Room: Moderated Discussions
rwessel (rwessel.delete@this.yahoo.com) on October 3, 2021 5:06 am wrote:
> Michael S (already5chosen.delete@this.yahoo.com) on October 3, 2021 1:51 am wrote:
> > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on October 2, 2021 10:43 am wrote:
> > > rpg (a.delete@this.b.com) on October 2, 2021 2:51 am wrote:
> > > >
> > > > Why handle memcpy with microcoded instructions/cracked uOPs?
> > > >
> > > > Wouldn't a simple DMA unit be able to handle this?
> > >
> > > DMA units are stupid.
> > >
> > > Seriously. Stop perpetuating that myth from the 80s.
> > >
> > > Back in the days long long gone, DMA units made sense because
> > >
> > > (a) CPU's were often slower than DRAM
> > >
> > > (b) caches weren't a thing
> > >
> > > and neither of those have been true for decades by now outside of some very very embedded stuff where the
> > > CPU isn't even remotely the main concern of the hardware (ie there are places where people have a very weak
> > > CPU that just handles some bookkeeping functionality, and
> > > the real heavy lifting is done by specialized hardware
> > > - very much including DMA engines built into those things. Think networking or media processors).
> > >
> > > Also, stop thinking that memory copies are about moving big amounts of data. That's very seldom
> > > actually true outside of some broken memory throughput benchmarks. The most common thing by
> > > far is moving small stuff that is a few tens of bytes in size, often isn't cacheline aligned,
> > > and is quite often somewhere in the cache hierarchy (but not necessarily L1 caches).
> > >
> > > The reason you want memset/memmove/memcpy instructions is because
> > >
> > > (a) the CPU memory unit already has buffers with byte masking and shifting built in
> > >
> > > (b) you should never expose the micro-architectural details of what exactly is the
> > > buffer size for said masking and shifting, and how many buffers you have etc etc.
> > >
> > > (c) you should absolutely not have to bring in the data to the register
> > > file, because you may be able to keep the data further away
> > >
> > > so anybody who says "just use vector instructions" is also wrong.
> > >
> > > No, the answer is not some DMA unit, because you'd just be screwing up caches with
> > > those, or duplicating your existing hardware. The latency of talking to an outside
> > > unit is higher than the cost of just doing the operation in 90% of all cases.
> > >
> > > And no, the answer is not vector units, because you'll just waste an incredible amount of effort and energy
> > > on trying to deal with the impedance issues of the visible instruction set and architectural state, and the
> > > low-level details of your memory unit, and you'll not be able to do clever things inside the caches.
> > >
> > > Linus
> >
> > I agree to everything except last paragraph.
> > The last paragraph is a demonstration of blind hatred.
> > VU *is* the best technical answer to fixed-width memory copy/set in range from
> > 9B to ~500B. May be, up to ~1000B. It just needs few proper instructions (like load/store
> > register pair with 1B granularity of source/destination) and ubiquity.
>
>
> Only if the vector unit doesn't need to be powered up,
True. But on newer CPUs it's less of the issue.
> and the save of its state can
> be amortized over other work.
If people will use VU for that sort of copying then the problem of amortization will solved :-)
Besides, I don't believe that in last couple of decades the problem ever was real.
IMHO, all lazy saving business should have been deprecated (in favor of eager saving) as soon as 2006 if not before that.
Roughly at time when 16KB and smaller L1D caches went out of fashion.
> In a lot of kernel code, the VU isn't available period.
Frankly, I don't care about what's going on in kernel.
In my performance-critical usage scenarios kernel performance almost never matters.
My personal feeling is that, apart from FLIH, kernel could be better (simpler and more robust, if not faster) with the same eager state saving policy that I advocate for user mode. But, obviously, I never did kernel profiling.
> Michael S (already5chosen.delete@this.yahoo.com) on October 3, 2021 1:51 am wrote:
> > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on October 2, 2021 10:43 am wrote:
> > > rpg (a.delete@this.b.com) on October 2, 2021 2:51 am wrote:
> > > >
> > > > Why handle memcpy with microcoded instructions/cracked uOPs?
> > > >
> > > > Wouldn't a simple DMA unit be able to handle this?
> > >
> > > DMA units are stupid.
> > >
> > > Seriously. Stop perpetuating that myth from the 80s.
> > >
> > > Back in the days long long gone, DMA units made sense because
> > >
> > > (a) CPU's were often slower than DRAM
> > >
> > > (b) caches weren't a thing
> > >
> > > and neither of those have been true for decades by now outside of some very very embedded stuff where the
> > > CPU isn't even remotely the main concern of the hardware (ie there are places where people have a very weak
> > > CPU that just handles some bookkeeping functionality, and
> > > the real heavy lifting is done by specialized hardware
> > > - very much including DMA engines built into those things. Think networking or media processors).
> > >
> > > Also, stop thinking that memory copies are about moving big amounts of data. That's very seldom
> > > actually true outside of some broken memory throughput benchmarks. The most common thing by
> > > far is moving small stuff that is a few tens of bytes in size, often isn't cacheline aligned,
> > > and is quite often somewhere in the cache hierarchy (but not necessarily L1 caches).
> > >
> > > The reason you want memset/memmove/memcpy instructions is because
> > >
> > > (a) the CPU memory unit already has buffers with byte masking and shifting built in
> > >
> > > (b) you should never expose the micro-architectural details of what exactly is the
> > > buffer size for said masking and shifting, and how many buffers you have etc etc.
> > >
> > > (c) you should absolutely not have to bring in the data to the register
> > > file, because you may be able to keep the data further away
> > >
> > > so anybody who says "just use vector instructions" is also wrong.
> > >
> > > No, the answer is not some DMA unit, because you'd just be screwing up caches with
> > > those, or duplicating your existing hardware. The latency of talking to an outside
> > > unit is higher than the cost of just doing the operation in 90% of all cases.
> > >
> > > And no, the answer is not vector units, because you'll just waste an incredible amount of effort and energy
> > > on trying to deal with the impedance issues of the visible instruction set and architectural state, and the
> > > low-level details of your memory unit, and you'll not be able to do clever things inside the caches.
> > >
> > > Linus
> >
> > I agree to everything except last paragraph.
> > The last paragraph is a demonstration of blind hatred.
> > VU *is* the best technical answer to fixed-width memory copy/set in range from
> > 9B to ~500B. May be, up to ~1000B. It just needs few proper instructions (like load/store
> > register pair with 1B granularity of source/destination) and ubiquity.
>
>
> Only if the vector unit doesn't need to be powered up,
True. But on newer CPUs it's less of the issue.
> and the save of its state can
> be amortized over other work.
If people will use VU for that sort of copying then the problem of amortization will solved :-)
Besides, I don't believe that in last couple of decades the problem ever was real.
IMHO, all lazy saving business should have been deprecated (in favor of eager saving) as soon as 2006 if not before that.
Roughly at time when 16KB and smaller L1D caches went out of fashion.
> In a lot of kernel code, the VU isn't available period.
Frankly, I don't care about what's going on in kernel.
In my performance-critical usage scenarios kernel performance almost never matters.
My personal feeling is that, apart from FLIH, kernel could be better (simpler and more robust, if not faster) with the same eager state saving policy that I advocate for user mode. But, obviously, I never did kernel profiling.