By: Charlie Burnes (charlie.burnes.delete@this.no-spam.com), May 22, 2022 6:30 am
Room: Moderated Discussions
>> The main thing I’m struggling with is I can’t write my code in a way that is SIMD width agnostic.
> I'm curious why that is not possible?
It’s not possible because the SIMD width is mapped to a rectangle of values and a very expensive computation has to be done to compute some constants that correspond to each value in that rectangle. This expensive computation only has to be done once for a particular SIMD width but the SIMD width needs to be known to do the expensive computation because the SIMD width determines the size of the rectangle of values. Also, this expensive computation is done off-line in different software than is used for the rest of the problem. The expensive computation has to be done using arbitrary precision arithmetic and I need to test the results to make sure I implemented it correctly before I use the results in the other software. If I really wanted to make a SIMD width agnostic version, I would have to test the expensive computation on a lot of different size rectangles of values corresponding to a lot of different SIMD widths to be sure it always works. That would be too much work. It doesn’t make sense to make a whole project out of supporting arbitrary SIMD widths because that is a small part of the overall problem.
> Given that you plan to use NEON, would SVE ever come into play?
I don’t need to worry about SVE or SVE2 unless Apple includes them in the M2 and removes NEON. I doubt Apple will remove NEON because that would break existing applications. I don’t need to support RISC-V Vector extensions.
> Where does the 512 number come from?
My first target is Intel CPUs with AVX-512. I’m looking for a way to write code for AVX-512 and have some way to make the code work on a chip with AVX2 but no AVX-512. That way, I can get good performance on CPUs with AVX-512 and the code will still be useable but slower on CPUs that do not have AVX-512. The code would too slow to be usable if I use a scalar version on CPUs with only AVX2. It is OK if the AVX2 version is 2x to 3x slower than the AVX-512 version, but using a scalar version that is 16x slower is not acceptable.
> But perhaps you only care about older HW with fixed-size vectors
I understand that my code will not be able to get any additional performance from AVX-1024 if that becomes widely available at some distant point in the future. I don’t think there is anything practical I can do about that.
> it would certainly be feasible to define a Vec256x2 and Vec128x4 class that implements 512 bits using
> two/four vectors. You could copy/adapt the Highway implementations of Vec256 and Vec128, or simply
> build on top of them and define the operations you want using two/four calls to the Highway ops.
That sounds like a great idea. Thank you! I saw the HWY_EMU128 target mentioned in the Highway Implementation Details document but it sounds like you are referring to something different here. I didn’t see Vec128 or Vec256 mentioned anywhere in the Highway docs I have read so far. I did a Google search for Vec256 site:github.com/google/highway and I got no matching documents.
> I'm curious why that is not possible?
It’s not possible because the SIMD width is mapped to a rectangle of values and a very expensive computation has to be done to compute some constants that correspond to each value in that rectangle. This expensive computation only has to be done once for a particular SIMD width but the SIMD width needs to be known to do the expensive computation because the SIMD width determines the size of the rectangle of values. Also, this expensive computation is done off-line in different software than is used for the rest of the problem. The expensive computation has to be done using arbitrary precision arithmetic and I need to test the results to make sure I implemented it correctly before I use the results in the other software. If I really wanted to make a SIMD width agnostic version, I would have to test the expensive computation on a lot of different size rectangles of values corresponding to a lot of different SIMD widths to be sure it always works. That would be too much work. It doesn’t make sense to make a whole project out of supporting arbitrary SIMD widths because that is a small part of the overall problem.
> Given that you plan to use NEON, would SVE ever come into play?
I don’t need to worry about SVE or SVE2 unless Apple includes them in the M2 and removes NEON. I doubt Apple will remove NEON because that would break existing applications. I don’t need to support RISC-V Vector extensions.
> Where does the 512 number come from?
My first target is Intel CPUs with AVX-512. I’m looking for a way to write code for AVX-512 and have some way to make the code work on a chip with AVX2 but no AVX-512. That way, I can get good performance on CPUs with AVX-512 and the code will still be useable but slower on CPUs that do not have AVX-512. The code would too slow to be usable if I use a scalar version on CPUs with only AVX2. It is OK if the AVX2 version is 2x to 3x slower than the AVX-512 version, but using a scalar version that is 16x slower is not acceptable.
> But perhaps you only care about older HW with fixed-size vectors
I understand that my code will not be able to get any additional performance from AVX-1024 if that becomes widely available at some distant point in the future. I don’t think there is anything practical I can do about that.
> it would certainly be feasible to define a Vec256x2 and Vec128x4 class that implements 512 bits using
> two/four vectors. You could copy/adapt the Highway implementations of Vec256 and Vec128, or simply
> build on top of them and define the operations you want using two/four calls to the Highway ops.
That sounds like a great idea. Thank you! I saw the HWY_EMU128 target mentioned in the Highway Implementation Details document but it sounds like you are referring to something different here. I didn’t see Vec128 or Vec256 mentioned anywhere in the Highway docs I have read so far. I did a Google search for Vec256 site:github.com/google/highway and I got no matching documents.