By: Brendan (btrotter.delete@this.gmail.com), October 2, 2021 4:53 pm
Room: Moderated Discussions
Hi,
--- (---.delete@this.redheron.com) on October 2, 2021 3:03 pm wrote:
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on October 2, 2021 10:43 am wrote:
> > rpg (a.delete@this.b.com) on October 2, 2021 2:51 am wrote:
> > >
> > > Why handle memcpy with microcoded instructions/cracked uOPs?
> > >
> > > Wouldn't a simple DMA unit be able to handle this?
> >
> > DMA units are stupid.
> >
> > Seriously. Stop perpetuating that myth from the 80s.
> >
> > Back in the days long long gone, DMA units made sense because
> >
> > (a) CPU's were often slower than DRAM
> >
> > (b) caches weren't a thing
> >
> > and neither of those have been true for decades by now outside of some very very embedded stuff where the
> > CPU isn't even remotely the main concern of the hardware (ie there are places where people have a very weak
> > CPU that just handles some bookkeeping functionality, and
> > the real heavy lifting is done by specialized hardware
> > - very much including DMA engines built into those things. Think networking or media processors).
> >
> > Also, stop thinking that memory copies are about moving big amounts of data. That's very seldom
> > actually true outside of some broken memory throughput benchmarks. The most common thing by
> > far is moving small stuff that is a few tens of bytes in size, often isn't cacheline aligned,
> > and is quite often somewhere in the cache hierarchy (but not necessarily L1 caches).
> >
> > The reason you want memset/memmove/memcpy instructions is because
> >
> > (a) the CPU memory unit already has buffers with byte masking and shifting built in
> >
> > (b) you should never expose the micro-architectural details of what exactly is the
> > buffer size for said masking and shifting, and how many buffers you have etc etc.
> >
> > (c) you should absolutely not have to bring in the data to the register
> > file, because you may be able to keep the data further away
> >
> > so anybody who says "just use vector instructions" is also wrong.
> >
> > No, the answer is not some DMA unit, because you'd just be screwing up caches with
> > those, or duplicating your existing hardware. The latency of talking to an outside
> > unit is higher than the cost of just doing the operation in 90% of all cases.
> >
> > And no, the answer is not vector units, because you'll just waste an incredible amount of effort and energy
> > on trying to deal with the impedance issues of the visible instruction set and architectural state, and the
> > low-level details of your memory unit, and you'll not be able to do clever things inside the caches.
> >
> > Linus
>
> Some DMA is stupid. It doesn't have to be. Functional DMA is possible.
>
> Likewise cache issues are only a problem if you engineer your system in a certain way.
> Imagine that, for example, your memory had, sitting before it (essentially as an extension
> of the memory controller) a *memory-side* cache, call it, I don't know, a "system level cache".
> And imagine that that cache contained tags describing the contents of every lower cache in
> the system (ie tags covering the caches of P-cluster, E-cluster, GPU, NPU, etc).
> Then that cache can acts as a central point for coherence, covering all transactions,
> even those from devices that are nominally non-coherent...
>
> And additional point (which you didn't discuss, but which Apple also cover, and which is
> also important for ultimate performance) is that Apple's DMA is able to place data directly
> into a cache (the patents suggest any cache level) rather than purely into DRAM.
>
> I agree with you primary point, that obsessing over super-bulk transfers is not the first thing to worry
> about. But you are incorrect that DMA has to be crippled, either in capabilities or because of coherence.
> Of course certain types of business models and eco-systems will or will not encourage the importance
> of DMA over doing it all in either the CPU or some other "large" piece of HW like a NIC.
>
> As far as I can tell, Apple has serious-level TCP-offload and accelerators
> in their DMA -- and by being in the DMA they apply to all network transactions,
> not just WiFi or cellular or wherever the NIC smarts are placed.
> My *guess* (only a guess) is that the reason for the massive jump in Apple's "hours of WiFi video
> streaming number" on the A15 iPhones, ranging from 1.8x to 2x last year's number; so much larger
> than all the other (also impressive) numbers; is because some final step in this particular networking
> scenario has been delegated to DMA offload, and so the entire operation (not just video decode,
> but also network) is being run by dedicated HW with no CPU wakeup on every packet.
The DMA that Linus seems to be thinking of (e.g. a dedicated "Intel 8237" chip sitting on a distant legacy/ISA bus) mostly got superseded/replaced by bus mastering functionality built directly into almost every device; and the "smart DMA" you suspect Apple is using has been relatively common/standard practice for about 15 years (since Gigabit Ethernet or earlier).
For "memcpy()", the main problem with DMA (and the ubiquitous bus mastering it evolved into) is the semantics - freeing the CPU from doing mundane data movement by doing it in dedicated logic in parallel (and raising an IRQ on completion) doesn't make as much sense when you're stuck with a procedural interface - it becomes "do nothing while waiting for completion" (which becomes "save power and/or improve the performance of the other logical processor/s sharing the core" without much of the benefits of parallelism).
I'd assume that the best place for a "dedicated data movement engine" is built directly into each core (possibly in a "one per core, shared by all logical processors in the core" manner) where it can talk directly to L1 cache, access TLBs, make use of store buffers, influence hardware prefetchers, etc; and where it can be stopped quickly (e.g. update register file ASAP when cancelled/interrupted by an IRQ or exception/page fault).
Note that (as far as I know) for 80x86 (e.g. "rep movsb") the big problem is that micro-code can't do branch prediction, so you end up with a bunch of tests ("is src aligned? is dest aligned? is size big enough?") that cause a bunch of stalls and high start-up costs; where all of that could be replaced by AND/OR/NOT gates and done almost instantly with almost zero start-up costs.
In other words; a "dedicated data movement engine" could/should probably be implemented in each core by shutting down the normal front-end and using fixed function logic that feeds existing micro-ops directly into the existing pipeline.
- Brendan
--- (---.delete@this.redheron.com) on October 2, 2021 3:03 pm wrote:
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on October 2, 2021 10:43 am wrote:
> > rpg (a.delete@this.b.com) on October 2, 2021 2:51 am wrote:
> > >
> > > Why handle memcpy with microcoded instructions/cracked uOPs?
> > >
> > > Wouldn't a simple DMA unit be able to handle this?
> >
> > DMA units are stupid.
> >
> > Seriously. Stop perpetuating that myth from the 80s.
> >
> > Back in the days long long gone, DMA units made sense because
> >
> > (a) CPU's were often slower than DRAM
> >
> > (b) caches weren't a thing
> >
> > and neither of those have been true for decades by now outside of some very very embedded stuff where the
> > CPU isn't even remotely the main concern of the hardware (ie there are places where people have a very weak
> > CPU that just handles some bookkeeping functionality, and
> > the real heavy lifting is done by specialized hardware
> > - very much including DMA engines built into those things. Think networking or media processors).
> >
> > Also, stop thinking that memory copies are about moving big amounts of data. That's very seldom
> > actually true outside of some broken memory throughput benchmarks. The most common thing by
> > far is moving small stuff that is a few tens of bytes in size, often isn't cacheline aligned,
> > and is quite often somewhere in the cache hierarchy (but not necessarily L1 caches).
> >
> > The reason you want memset/memmove/memcpy instructions is because
> >
> > (a) the CPU memory unit already has buffers with byte masking and shifting built in
> >
> > (b) you should never expose the micro-architectural details of what exactly is the
> > buffer size for said masking and shifting, and how many buffers you have etc etc.
> >
> > (c) you should absolutely not have to bring in the data to the register
> > file, because you may be able to keep the data further away
> >
> > so anybody who says "just use vector instructions" is also wrong.
> >
> > No, the answer is not some DMA unit, because you'd just be screwing up caches with
> > those, or duplicating your existing hardware. The latency of talking to an outside
> > unit is higher than the cost of just doing the operation in 90% of all cases.
> >
> > And no, the answer is not vector units, because you'll just waste an incredible amount of effort and energy
> > on trying to deal with the impedance issues of the visible instruction set and architectural state, and the
> > low-level details of your memory unit, and you'll not be able to do clever things inside the caches.
> >
> > Linus
>
> Some DMA is stupid. It doesn't have to be. Functional DMA is possible.
>
> Likewise cache issues are only a problem if you engineer your system in a certain way.
> Imagine that, for example, your memory had, sitting before it (essentially as an extension
> of the memory controller) a *memory-side* cache, call it, I don't know, a "system level cache".
> And imagine that that cache contained tags describing the contents of every lower cache in
> the system (ie tags covering the caches of P-cluster, E-cluster, GPU, NPU, etc).
> Then that cache can acts as a central point for coherence, covering all transactions,
> even those from devices that are nominally non-coherent...
>
> And additional point (which you didn't discuss, but which Apple also cover, and which is
> also important for ultimate performance) is that Apple's DMA is able to place data directly
> into a cache (the patents suggest any cache level) rather than purely into DRAM.
>
> I agree with you primary point, that obsessing over super-bulk transfers is not the first thing to worry
> about. But you are incorrect that DMA has to be crippled, either in capabilities or because of coherence.
> Of course certain types of business models and eco-systems will or will not encourage the importance
> of DMA over doing it all in either the CPU or some other "large" piece of HW like a NIC.
>
> As far as I can tell, Apple has serious-level TCP-offload and accelerators
> in their DMA -- and by being in the DMA they apply to all network transactions,
> not just WiFi or cellular or wherever the NIC smarts are placed.
> My *guess* (only a guess) is that the reason for the massive jump in Apple's "hours of WiFi video
> streaming number" on the A15 iPhones, ranging from 1.8x to 2x last year's number; so much larger
> than all the other (also impressive) numbers; is because some final step in this particular networking
> scenario has been delegated to DMA offload, and so the entire operation (not just video decode,
> but also network) is being run by dedicated HW with no CPU wakeup on every packet.
The DMA that Linus seems to be thinking of (e.g. a dedicated "Intel 8237" chip sitting on a distant legacy/ISA bus) mostly got superseded/replaced by bus mastering functionality built directly into almost every device; and the "smart DMA" you suspect Apple is using has been relatively common/standard practice for about 15 years (since Gigabit Ethernet or earlier).
For "memcpy()", the main problem with DMA (and the ubiquitous bus mastering it evolved into) is the semantics - freeing the CPU from doing mundane data movement by doing it in dedicated logic in parallel (and raising an IRQ on completion) doesn't make as much sense when you're stuck with a procedural interface - it becomes "do nothing while waiting for completion" (which becomes "save power and/or improve the performance of the other logical processor/s sharing the core" without much of the benefits of parallelism).
I'd assume that the best place for a "dedicated data movement engine" is built directly into each core (possibly in a "one per core, shared by all logical processors in the core" manner) where it can talk directly to L1 cache, access TLBs, make use of store buffers, influence hardware prefetchers, etc; and where it can be stopped quickly (e.g. update register file ASAP when cancelled/interrupted by an IRQ or exception/page fault).
Note that (as far as I know) for 80x86 (e.g. "rep movsb") the big problem is that micro-code can't do branch prediction, so you end up with a bunch of tests ("is src aligned? is dest aligned? is size big enough?") that cause a bunch of stalls and high start-up costs; where all of that could be replaced by AND/OR/NOT gates and done almost instantly with almost zero start-up costs.
In other words; a "dedicated data movement engine" could/should probably be implemented in each core by shutting down the normal front-end and using fixed function logic that feeds existing micro-ops directly into the existing pipeline.
- Brendan