By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), October 3, 2021 9:48 am
Room: Moderated Discussions
Brendan (btrotter.delete@this.gmail.com) on October 2, 2021 4:53 pm wrote:
>
> The DMA that Linus seems to be thinking of (e.g. a dedicated "Intel 8237" chip sitting on a
> distant legacy/ISA bus) mostly got superseded/replaced by bus mastering functionality built
> directly into almost every device;
No, I'm very much thinking of modern DMA - on-die and cache coherent.
Because even when it is close-by and cache coherent, it is just stupid and wrong for CPU memory copies.
The memory accesses need to be done by the memory unit itself, in order to not screw up data that is in the local L1 cache - which is not at all uncommon.
In fact, you want it even closer than the L1D$ - you want it to interact with the actual store buffers. You want the memory copies to be visible and part of the OoO machinery, so that you can do the re-ordering, do the store buffer snooping, etc. They should use all the byte lane shifting and masking logic that is very much already there.
I call it a "CPU memory copy logic", and claim that it should be driven by CPU instructions, and that is should become part of the uop stream and schedule with the instructions around it. So it needs to be right there in the very core of the CPU, by the same memory unit that handles all the other loads and stores.
Nobody sane calls that "DMA" unless you just want to make some mental gymnastics.
Any actual DMA engine that is not directly tied to the CPU memory unit is much too far away, and cannot be sanely ordered wrt all the other memory ops around the memcpy that will be right there.
A very common memory copy pattern is to literally copy a data structure, and then modify the result. We're talking tens - maybe hundreds - of bytes, that can be done with one or two byte shift/mask buffer movements that should be scheduled exactly like any random read or write. Except it isn't a fixed-size read, and it doesn't go through a register file.
Linus
>
> The DMA that Linus seems to be thinking of (e.g. a dedicated "Intel 8237" chip sitting on a
> distant legacy/ISA bus) mostly got superseded/replaced by bus mastering functionality built
> directly into almost every device;
No, I'm very much thinking of modern DMA - on-die and cache coherent.
Because even when it is close-by and cache coherent, it is just stupid and wrong for CPU memory copies.
The memory accesses need to be done by the memory unit itself, in order to not screw up data that is in the local L1 cache - which is not at all uncommon.
In fact, you want it even closer than the L1D$ - you want it to interact with the actual store buffers. You want the memory copies to be visible and part of the OoO machinery, so that you can do the re-ordering, do the store buffer snooping, etc. They should use all the byte lane shifting and masking logic that is very much already there.
I call it a "CPU memory copy logic", and claim that it should be driven by CPU instructions, and that is should become part of the uop stream and schedule with the instructions around it. So it needs to be right there in the very core of the CPU, by the same memory unit that handles all the other loads and stores.
Nobody sane calls that "DMA" unless you just want to make some mental gymnastics.
Any actual DMA engine that is not directly tied to the CPU memory unit is much too far away, and cannot be sanely ordered wrt all the other memory ops around the memcpy that will be right there.
A very common memory copy pattern is to literally copy a data structure, and then modify the result. We're talking tens - maybe hundreds - of bytes, that can be done with one or two byte shift/mask buffer movements that should be scheduled exactly like any random read or write. Except it isn't a fixed-size read, and it doesn't go through a register file.
Linus