By: rwessel (rwessel.delete@this.yahoo.com), September 29, 2021 11:22 pm
Room: Moderated Discussions
Doug S (foo.delete@this.bar.bar) on September 29, 2021 10:02 pm wrote:
> rwessel (rwessel.delete@this.yahoo.com) on September 29, 2021 11:30 am wrote:
> > Doug S (foo.delete@this.bar.bar) on September 29, 2021 10:10 am wrote:
> > > rwessel (rwessel.delete@this.yahoo.com) on September 29, 2021 6:55 am wrote:
> > > > NoSpammer (no.delete@this.spam.com) on September 29, 2021 3:53 am wrote:
> > > > > dmcq (dmcq.delete@this.fano.co.uk) on September 28, 2021 2:21 pm wrote:
> > > > > > The bits could vary between implementations so letting designers optimise better. More conditions
> > > > > > could be saved. The only problem I can see is big-little systems and they could just zero the bits
> > > > > > if moving between different cores - with the current system how can we tell if the form optimised
> > > > > > by one is okay for the other? And yes it seems like an unnecessary waste of opcodes and code space.
> > > > >
> > > > > I think 3 instructions make very simple implementations possible for the low-end. Per example:
> > > > > Initial instruction is movsb until aligned or end.
> > > > > Middle instruction is movs[your core's biggest R/W chunk]
> > > > > End instruction is movsb until end.
> > > > >
> > > > > Why have more state when the state can already be in the registers and PC?
> > > >
> > > >
> > > > Remember that in the general case, you can't get both operands aligned, so you can't avoid
> > > > dealing with at least some of that in the "middle" instruction. But the point is, how
> > > > hard could it be to detect the case where the initial or final instruction apply?
> > >
> > >
> > > Why would you need to deal with any alignment related issues in the middle instruction? If
> > > the area of memory you are operating upon is large enough you have a middle sequence where
> > > you can load and store on aligned boundaries at your largest chunk size (and there may even
> > > be some clever ways to massively speed this up with some sort of cache aliasing trickery)
> > >
> > > In small operations the middle instruction may be a no-op,
> > > and only the first and/or last instruction actually
> > > do something. I can't see any case where you would need to worry about alignment in the middle instruction
> > > - that's the whole point of splitting it up this way! Can you provide an example of where you think the
> > > first or third instruction would be unable to guarantee the middle instruction perfect alignment?
> >
> >
> > You can trivially align one of the two operands for the middle instruction, but not necessarily both.
> >
> > Consider memcpy(123, 345, 100);
> >
> > So you could have the first instruction move five bytes, and that leaves you with memcpy(128,
> > 350, 95). The second operand remains unaligned. Or move seven bytes with the first instruction,
> > which leaves you with memcpy(130, 352, 93), and an unaligned first operand.
> >
> > While knowing that one operand is aligned may well be of value to the middle instruction, it's going
> > to have to deal with the possibility of the other operand being unaligned. Assuming you'd align the
> > first operand, it would still need to do fetch/shift/merge on each word of the second operand.
>
>
> I thought you were suggesting one has to deal with alignment issues in both, since
> you were implying there's no point to having the first/third instructions.
>
> You can't simultaneously eliminate alignment issues on both loads and stores but it really only matters
> that you eliminate alignment issues on stores. That's where things get ugly. Alignment is not too much
> of a problem on loads since the shift/combine can be moved outside the critical timing path - by either
> doing it when you prefetch from memory (these instructions are a dream for a prefetcher, zero prediction
> required!) or if already in cache doing it when it is moved from one cache location to another via passing
> through a hidden register that handles all the possible shift/combine scenarios.
Certainly. But I still don't see the point of the separate setup and finalize instructions - detecting those conditions is trivial (if the destination address has any low bits set, do "first", if you've fallen out of the "middle" loop, and the length is not zero, do a "last"). Internalizing that stuff would probably make it easier to sneak up on page boundaries as well, at least for simpler implementations.
> rwessel (rwessel.delete@this.yahoo.com) on September 29, 2021 11:30 am wrote:
> > Doug S (foo.delete@this.bar.bar) on September 29, 2021 10:10 am wrote:
> > > rwessel (rwessel.delete@this.yahoo.com) on September 29, 2021 6:55 am wrote:
> > > > NoSpammer (no.delete@this.spam.com) on September 29, 2021 3:53 am wrote:
> > > > > dmcq (dmcq.delete@this.fano.co.uk) on September 28, 2021 2:21 pm wrote:
> > > > > > The bits could vary between implementations so letting designers optimise better. More conditions
> > > > > > could be saved. The only problem I can see is big-little systems and they could just zero the bits
> > > > > > if moving between different cores - with the current system how can we tell if the form optimised
> > > > > > by one is okay for the other? And yes it seems like an unnecessary waste of opcodes and code space.
> > > > >
> > > > > I think 3 instructions make very simple implementations possible for the low-end. Per example:
> > > > > Initial instruction is movsb until aligned or end.
> > > > > Middle instruction is movs[your core's biggest R/W chunk]
> > > > > End instruction is movsb until end.
> > > > >
> > > > > Why have more state when the state can already be in the registers and PC?
> > > >
> > > >
> > > > Remember that in the general case, you can't get both operands aligned, so you can't avoid
> > > > dealing with at least some of that in the "middle" instruction. But the point is, how
> > > > hard could it be to detect the case where the initial or final instruction apply?
> > >
> > >
> > > Why would you need to deal with any alignment related issues in the middle instruction? If
> > > the area of memory you are operating upon is large enough you have a middle sequence where
> > > you can load and store on aligned boundaries at your largest chunk size (and there may even
> > > be some clever ways to massively speed this up with some sort of cache aliasing trickery)
> > >
> > > In small operations the middle instruction may be a no-op,
> > > and only the first and/or last instruction actually
> > > do something. I can't see any case where you would need to worry about alignment in the middle instruction
> > > - that's the whole point of splitting it up this way! Can you provide an example of where you think the
> > > first or third instruction would be unable to guarantee the middle instruction perfect alignment?
> >
> >
> > You can trivially align one of the two operands for the middle instruction, but not necessarily both.
> >
> > Consider memcpy(123, 345, 100);
> >
> > So you could have the first instruction move five bytes, and that leaves you with memcpy(128,
> > 350, 95). The second operand remains unaligned. Or move seven bytes with the first instruction,
> > which leaves you with memcpy(130, 352, 93), and an unaligned first operand.
> >
> > While knowing that one operand is aligned may well be of value to the middle instruction, it's going
> > to have to deal with the possibility of the other operand being unaligned. Assuming you'd align the
> > first operand, it would still need to do fetch/shift/merge on each word of the second operand.
>
>
> I thought you were suggesting one has to deal with alignment issues in both, since
> you were implying there's no point to having the first/third instructions.
>
> You can't simultaneously eliminate alignment issues on both loads and stores but it really only matters
> that you eliminate alignment issues on stores. That's where things get ugly. Alignment is not too much
> of a problem on loads since the shift/combine can be moved outside the critical timing path - by either
> doing it when you prefetch from memory (these instructions are a dream for a prefetcher, zero prediction
> required!) or if already in cache doing it when it is moved from one cache location to another via passing
> through a hidden register that handles all the possible shift/combine scenarios.
Certainly. But I still don't see the point of the separate setup and finalize instructions - detecting those conditions is trivial (if the destination address has any low bits set, do "first", if you've fallen out of the "middle" loop, and the length is not zero, do a "last"). Internalizing that stuff would probably make it easier to sneak up on page boundaries as well, at least for simpler implementations.