By: dmcq (dmcq.delete@this.fano.co.uk), October 3, 2021 1:54 pm
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on October 3, 2021 10:48 am wrote:
> Brendan (btrotter.delete@this.gmail.com) on October 2, 2021 4:53 pm wrote:
> >
> > The DMA that Linus seems to be thinking of (e.g. a dedicated "Intel 8237" chip sitting on a
> > distant legacy/ISA bus) mostly got superseded/replaced by bus mastering functionality built
> > directly into almost every device;
>
> No, I'm very much thinking of modern DMA - on-die and cache coherent.
>
> Because even when it is close-by and cache coherent, it is just stupid and wrong for CPU memory copies.
>
> The memory accesses need to be done by the memory unit itself, in order to not
> screw up data that is in the local L1 cache - which is not at all uncommon.
>
> In fact, you want it even closer than the L1D$ - you want it to interact with the actual
> store buffers. You want the memory copies to be visible and part of the OoO machinery,
> so that you can do the re-ordering, do the store buffer snooping, etc. They should use
> all the byte lane shifting and masking logic that is very much already there.
>
> I call it a "CPU memory copy logic", and claim that it should be driven by CPU instructions, and that is should
> become part of the uop stream and schedule with the instructions around it. So it needs to be right there
> in the very core of the CPU, by the same memory unit that handles all the other loads and stores.
>
> Nobody sane calls that "DMA" unless you just want to make some mental gymnastics.
>
> Any actual DMA engine that is not directly tied to the CPU memory unit is much too far away, and cannot
> be sanely ordered wrt all the other memory ops around the memcpy that will be right there.
>
> A very common memory copy pattern is to literally copy a data structure, and then modify
> the result. We're talking tens - maybe hundreds - of bytes, that can be done with one or
> two byte shift/mask buffer movements that should be scheduled exactly like any random read
> or write. Except it isn't a fixed-size read, and it doesn't go through a register file.
>
> Linus
I think if a property is to be applied to the data it probably should be set to streaming or non-temporal if it isn't already in the L1D cache. After that it might either go into L1D but be marked as such or sent straight to L2D depending on implementation - but as Linus points out it would be very likely counter productive to try avoiding putting it into L2D.
> Brendan (btrotter.delete@this.gmail.com) on October 2, 2021 4:53 pm wrote:
> >
> > The DMA that Linus seems to be thinking of (e.g. a dedicated "Intel 8237" chip sitting on a
> > distant legacy/ISA bus) mostly got superseded/replaced by bus mastering functionality built
> > directly into almost every device;
>
> No, I'm very much thinking of modern DMA - on-die and cache coherent.
>
> Because even when it is close-by and cache coherent, it is just stupid and wrong for CPU memory copies.
>
> The memory accesses need to be done by the memory unit itself, in order to not
> screw up data that is in the local L1 cache - which is not at all uncommon.
>
> In fact, you want it even closer than the L1D$ - you want it to interact with the actual
> store buffers. You want the memory copies to be visible and part of the OoO machinery,
> so that you can do the re-ordering, do the store buffer snooping, etc. They should use
> all the byte lane shifting and masking logic that is very much already there.
>
> I call it a "CPU memory copy logic", and claim that it should be driven by CPU instructions, and that is should
> become part of the uop stream and schedule with the instructions around it. So it needs to be right there
> in the very core of the CPU, by the same memory unit that handles all the other loads and stores.
>
> Nobody sane calls that "DMA" unless you just want to make some mental gymnastics.
>
> Any actual DMA engine that is not directly tied to the CPU memory unit is much too far away, and cannot
> be sanely ordered wrt all the other memory ops around the memcpy that will be right there.
>
> A very common memory copy pattern is to literally copy a data structure, and then modify
> the result. We're talking tens - maybe hundreds - of bytes, that can be done with one or
> two byte shift/mask buffer movements that should be scheduled exactly like any random read
> or write. Except it isn't a fixed-size read, and it doesn't go through a register file.
>
> Linus
I think if a property is to be applied to the data it probably should be set to streaming or non-temporal if it isn't already in the L1D cache. After that it might either go into L1D but be marked as such or sent straight to L2D depending on implementation - but as Linus points out it would be very likely counter productive to try avoiding putting it into L2D.