By: Jörn Engel (joern.delete@this.purestorage.com), May 22, 2022 1:13 pm
Room: Moderated Discussions
Charlie Burnes (charlie.burnes.delete@this.no-spam.com) on May 21, 2022 10:11 pm wrote:
>
> If the choice is your way or the Highway, what is your way?
static inline u64 hash8_64b(const void *ibuf)
{
u64 ret = 0;
#pragma GCC unroll 4
for (int i = 64 - VECTORLEN; i >= 0; i -= VECTORLEN) {
ret <<= VECTORLEN;
ret |= hash8_core(ibuf + i);
}
return ret;
}
In this case I want to chew through 64 bytes worth of input and generate a u64 result. The loop iterates 1x, 2x or 4x depending on vector width. Core looks something like this:
u16v m = u16v_set1(0x9e37);
u16v acc0 ^= read16v(ip + 0);
acc0 *= m;
acc0 = shuffle_u16v(acc0, swab16_mask);
acc0 ^= read16v(ibuf + 2);
acc0 *= m;
...
u16v is a vector worth of u16 members. For simple arithmetic, you continue using C operators. For more complex stuff you need functions similar to Intel intrinsics. But unlike intrinsics I have different integer types for u8v, u16v, s32v, etc. Therefore I have more wrapper functions for the various types than intrinsics. But I also get more type safety. And in general my code is easier to read, which also makes it easier to spot bugs in it or reuse code written last year.
Then you compile the code multiple times for the different targets and have a runtime dispatcher. That currently involves some ugly boilerplate code, I'm still looking for a solution that doesn't annoy me. Useful targets imo are SSE4.2, AVX2, Skylake and Icelake. Icelake introduces the the byte-variants of compress/expand, which can be a big deal. And pre-Icelake versions of AVX512 are still common enough to matter. Pre-SSE4.2 is probably in a museum by now. Pre-AVX2 is getting pretty rare, so maybe you don't need that anymore either.
Apart from the C/C++ preference, one distinction to Highway is that I always use the unaligned IO instructions. I cannot measure a performance difference between aligned and unaligned IO _instructions_. There is a performance difference between aligned and unaligned _data_. So really the decision is whether you prefer runtime faults for unaligned data or lower performance.
There are a few more stylistic things like CamelCase vs. under_scores in names, tabs vs. two spaces, etc. As we all know, those details matter much more than the actual functionality. Jehova! ;)
>
> If the choice is your way or the Highway, what is your way?
static inline u64 hash8_64b(const void *ibuf)
{
u64 ret = 0;
#pragma GCC unroll 4
for (int i = 64 - VECTORLEN; i >= 0; i -= VECTORLEN) {
ret <<= VECTORLEN;
ret |= hash8_core(ibuf + i);
}
return ret;
}
In this case I want to chew through 64 bytes worth of input and generate a u64 result. The loop iterates 1x, 2x or 4x depending on vector width. Core looks something like this:
u16v m = u16v_set1(0x9e37);
u16v acc0 ^= read16v(ip + 0);
acc0 *= m;
acc0 = shuffle_u16v(acc0, swab16_mask);
acc0 ^= read16v(ibuf + 2);
acc0 *= m;
...
u16v is a vector worth of u16 members. For simple arithmetic, you continue using C operators. For more complex stuff you need functions similar to Intel intrinsics. But unlike intrinsics I have different integer types for u8v, u16v, s32v, etc. Therefore I have more wrapper functions for the various types than intrinsics. But I also get more type safety. And in general my code is easier to read, which also makes it easier to spot bugs in it or reuse code written last year.
Then you compile the code multiple times for the different targets and have a runtime dispatcher. That currently involves some ugly boilerplate code, I'm still looking for a solution that doesn't annoy me. Useful targets imo are SSE4.2, AVX2, Skylake and Icelake. Icelake introduces the the byte-variants of compress/expand, which can be a big deal. And pre-Icelake versions of AVX512 are still common enough to matter. Pre-SSE4.2 is probably in a museum by now. Pre-AVX2 is getting pretty rare, so maybe you don't need that anymore either.
Apart from the C/C++ preference, one distinction to Highway is that I always use the unaligned IO instructions. I cannot measure a performance difference between aligned and unaligned IO _instructions_. There is a performance difference between aligned and unaligned _data_. So really the decision is whether you prefer runtime faults for unaligned data or lower performance.
There are a few more stylistic things like CamelCase vs. under_scores in names, tabs vs. two spaces, etc. As we all know, those details matter much more than the actual functionality. Jehova! ;)