By: Marcus (m.delete@this.bitsnbites.eu), August 13, 2022 7:54 am
Room: Moderated Discussions
dmcq (dmcq.delete@this.fano.co.uk) on August 5, 2022 12:09 pm wrote:
> Doug S (foo.delete@this.bar.bar) on August 5, 2022 11:18 am wrote:
> > dmcq (dmcq.delete@this.fano.co.uk) on August 5, 2022 5:13 am wrote:
> > > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on August 4, 2022 11:30 am wrote:
> > > > Wilco (wilco.dijkstra.delete@this.ntlworld.com) on August 4, 2022 4:44 am wrote:
> > > > >
> > > > > 64-bit integer constants that are actually 64 bits are exceedingly rare, so loading them is a
> > > > > perfectly good strategy
> > > >
> > > > They really aren't. They can be common, and they can be performance-critical,
> > > > and using a separate D$ access for them is a mistake.
> > > >
> > > > Not on all loads, no, but on real loads.
> > > >
> > > > It's a mistake that arm64 didn't even do, so I don't understand why you even argue for
> > > > it, since usually you are trying to claim that arm did everything perfectly right.
> > > >
> > > > Big constants aren't super-common, but they do happen, and depending on the code,
> > > > they can happen quite regularly, and they can be quite performance critical.
> > > >
> > > > I didn't mention that "divide by a constant" on a whim. If you turn a divide by a constant into
> > > > a reciprocal multiply (which is generally going to be exactly that full 64-bit constant that
> > > > you claim never happens), you really don't want to cause a random D$ miss just to get the constant.
> > > > Because if you do, you're better off just using the divider in the first place.
> > > >
> > > > (And I should have used a small constant to divide by in my Godbolt example, just
> > > > to show how even a common constant like "divide by 10" ends up generating a large
> > > > 64-bit constant in the actual code stream. Maybe some people didn't realize)
> > > >
> > > > There are other cases. Multiplying by a large constant is not just for doing reciprocal
> > > > division, it's also one of the fastest and most common ways to generate reasonable non-cryptographic
> > > > hashes. And non-cryptographic hashes are not exactly unusual - it's what you want for
> > > > hash table lookups. So we're not talking some odd-ball thing here.
> > > >
> > > > Again, this is often performance-critical code, and very much not some kind of "rare" situation.
> > > > And the that constant is - intentionally - again using a lot of bits spread randomly around, because
> > > > it would be entirely against the whole point to multiply with some regular mask value.
> > > >
> > > > And once again, it's true that you can always load things from memory, but this is
> > > > code that takes cache misses anyway and most definitely does not want to take another
> > > > one. Putting the constant in the instruction stream is simply the best option.
> > > >
> > > > And also, it's true that some loads might never do any of this. Maybe you don't have a
> > > > single divide-by-a-constant. Maybe you just don't use hash tables. Maybe you don't do a
> > > > PRNG, or any number of other things that might be performance critical in other code.
> > > >
> > > > If your argument is that other architectures have done this wrong, then yes, that's
> > > > clearly true. There are bad architectures out there. That's not an argument.
> > > >
> > > > If your argument is that completely different kinds of large constants are addresses, and need to
> > > > be in memory anyway because you have relocation issues etc, and you may need to change them and
> > > > you don't want to change the text stream, then yes, that is also true, and also entirely irrelevant.
> > > > Addresses are 64-bit entities, but they aren't actually compile-time constants, and handling addresses
> > > > by a combination of relative offsets and relocation tables is entirely immaterial.
> > > >
> > > > Again, there are architectures that have gotten this horribly wrong, and didn't even have RIP
> > > > addressing, and then people who are used to that brain-damage think that this is somehow relevant
> > > > to "large constants", but that's just an effect of a horrible architectural design mistake.
> > > >
> > > > Anyway, immediates do actually matter. The full 64-bit immediate case is obviously less common and
> > > > less critical than smaller sizes are (and to be fair, some architectures get even the small immediate
> > > > case wrong, to the point of not having reasonable pointer offset arithmetic in their memory pipeline),
> > > > but full 64-bit immediates aren't quite as unimportant as you try to make them be.
> > > >
> > > > And you should be happy: arm64 doesn't get this horribly wrong. Others definitely do.
> > > >
> > > > Linus
> > >
> > > I can't quite agree with all that. Putting large constants in-line slows down the instruction fetch
> > > and the constants can be fetched at the same time. They should normally all be bunched together and
> > > prefetched together so there is not a big problem about random lines being loaded any more than there
> > > is for instructions. And if it does have to be loaded as in your example about the divide then we're
> > > probably talking about straight through code rather than something performance critical.
> >
> >
> > You're rarely bound by instruction fetch bandwidth, so why is this a concern? The potential loss of
> > a few issue slots across a few cycles (and only true in cases where you would be able to fill those
> > slots otherwise) is much less of a problem than potentially creating a 50+ cycle pipeline bubble if
> > you have to wait for a load from a separate constant table that goes all the way to main memory.
>
> Maybe we're rarely bound because we don't have loads of instructions with large constants!
Isn't that the exact reason for why you're likely to take a D$ miss for large constants that go via the data stream instead of the instruction stream? Caches like temporal and spatial coherence, and you're simply not getting that for this use case.
> Or because
> an instruction cache is used to avoid all the problems with variable instruction length.
>
> Interestingly The Mill https://millcomputing.com/ (no there isn't any news!)
> has instructions split in two to try and improve the instruction fetch speed!
> Can't say I'm totally convinced by that so I don't expect you to be either :-)
> Doug S (foo.delete@this.bar.bar) on August 5, 2022 11:18 am wrote:
> > dmcq (dmcq.delete@this.fano.co.uk) on August 5, 2022 5:13 am wrote:
> > > Linus Torvalds (torvalds.delete@this.linux-foundation.org) on August 4, 2022 11:30 am wrote:
> > > > Wilco (wilco.dijkstra.delete@this.ntlworld.com) on August 4, 2022 4:44 am wrote:
> > > > >
> > > > > 64-bit integer constants that are actually 64 bits are exceedingly rare, so loading them is a
> > > > > perfectly good strategy
> > > >
> > > > They really aren't. They can be common, and they can be performance-critical,
> > > > and using a separate D$ access for them is a mistake.
> > > >
> > > > Not on all loads, no, but on real loads.
> > > >
> > > > It's a mistake that arm64 didn't even do, so I don't understand why you even argue for
> > > > it, since usually you are trying to claim that arm did everything perfectly right.
> > > >
> > > > Big constants aren't super-common, but they do happen, and depending on the code,
> > > > they can happen quite regularly, and they can be quite performance critical.
> > > >
> > > > I didn't mention that "divide by a constant" on a whim. If you turn a divide by a constant into
> > > > a reciprocal multiply (which is generally going to be exactly that full 64-bit constant that
> > > > you claim never happens), you really don't want to cause a random D$ miss just to get the constant.
> > > > Because if you do, you're better off just using the divider in the first place.
> > > >
> > > > (And I should have used a small constant to divide by in my Godbolt example, just
> > > > to show how even a common constant like "divide by 10" ends up generating a large
> > > > 64-bit constant in the actual code stream. Maybe some people didn't realize)
> > > >
> > > > There are other cases. Multiplying by a large constant is not just for doing reciprocal
> > > > division, it's also one of the fastest and most common ways to generate reasonable non-cryptographic
> > > > hashes. And non-cryptographic hashes are not exactly unusual - it's what you want for
> > > > hash table lookups. So we're not talking some odd-ball thing here.
> > > >
> > > > Again, this is often performance-critical code, and very much not some kind of "rare" situation.
> > > > And the that constant is - intentionally - again using a lot of bits spread randomly around, because
> > > > it would be entirely against the whole point to multiply with some regular mask value.
> > > >
> > > > And once again, it's true that you can always load things from memory, but this is
> > > > code that takes cache misses anyway and most definitely does not want to take another
> > > > one. Putting the constant in the instruction stream is simply the best option.
> > > >
> > > > And also, it's true that some loads might never do any of this. Maybe you don't have a
> > > > single divide-by-a-constant. Maybe you just don't use hash tables. Maybe you don't do a
> > > > PRNG, or any number of other things that might be performance critical in other code.
> > > >
> > > > If your argument is that other architectures have done this wrong, then yes, that's
> > > > clearly true. There are bad architectures out there. That's not an argument.
> > > >
> > > > If your argument is that completely different kinds of large constants are addresses, and need to
> > > > be in memory anyway because you have relocation issues etc, and you may need to change them and
> > > > you don't want to change the text stream, then yes, that is also true, and also entirely irrelevant.
> > > > Addresses are 64-bit entities, but they aren't actually compile-time constants, and handling addresses
> > > > by a combination of relative offsets and relocation tables is entirely immaterial.
> > > >
> > > > Again, there are architectures that have gotten this horribly wrong, and didn't even have RIP
> > > > addressing, and then people who are used to that brain-damage think that this is somehow relevant
> > > > to "large constants", but that's just an effect of a horrible architectural design mistake.
> > > >
> > > > Anyway, immediates do actually matter. The full 64-bit immediate case is obviously less common and
> > > > less critical than smaller sizes are (and to be fair, some architectures get even the small immediate
> > > > case wrong, to the point of not having reasonable pointer offset arithmetic in their memory pipeline),
> > > > but full 64-bit immediates aren't quite as unimportant as you try to make them be.
> > > >
> > > > And you should be happy: arm64 doesn't get this horribly wrong. Others definitely do.
> > > >
> > > > Linus
> > >
> > > I can't quite agree with all that. Putting large constants in-line slows down the instruction fetch
> > > and the constants can be fetched at the same time. They should normally all be bunched together and
> > > prefetched together so there is not a big problem about random lines being loaded any more than there
> > > is for instructions. And if it does have to be loaded as in your example about the divide then we're
> > > probably talking about straight through code rather than something performance critical.
> >
> >
> > You're rarely bound by instruction fetch bandwidth, so why is this a concern? The potential loss of
> > a few issue slots across a few cycles (and only true in cases where you would be able to fill those
> > slots otherwise) is much less of a problem than potentially creating a 50+ cycle pipeline bubble if
> > you have to wait for a load from a separate constant table that goes all the way to main memory.
>
> Maybe we're rarely bound because we don't have loads of instructions with large constants!
Isn't that the exact reason for why you're likely to take a D$ miss for large constants that go via the data stream instead of the instruction stream? Caches like temporal and spatial coherence, and you're simply not getting that for this use case.
> Or because
> an instruction cache is used to avoid all the problems with variable instruction length.
>
> Interestingly The Mill https://millcomputing.com/ (no there isn't any news!)
> has instructions split in two to try and improve the instruction fetch speed!
> Can't say I'm totally convinced by that so I don't expect you to be either :-)