By: --- (---.delete@this.redheron.com), July 9, 2022 3:07 pm
Room: Moderated Discussions
Here's a crazy idea, if that sort of thing appeals to you...
Suppose you want to execute "vector" type code as aggressively as possible.
One way to do this is via packed SIMD. The disadvantages are that you're limited to a fixed register length (though that can be worked around to some extent if you start wide but implement narrow...) and so you do the MMX to AVX to AVX256 to AVX512 business.
A second option is SVE. That gives you a little more future growth flexibility (though ARM has probably taken the flexibility way beyond what makes any sense) and some degree of simplification of head+tail loop cleanup, even beyond the predicates provided by AVX512.
But there is a third option, as described in POWER's recent Scalable Vector proposal
https://libre-soc.org/openpower/sv/
which you could argue is in the spirit of "modern" micro-architecture (ie screw concepts like RISC vs CISC, what matters is optimal conveyance of the *goal* from the compiler to the CPU, with throwing large amounts of transistors at the problem not really an issue). Thus, to simplify dramatically, SVP64 rather than adding thousands of new instructions ala AVX512 or SVE adds just a few instructions (and a few more registers) that carefully describe certain types of loops, with the idea being that the CPU will translate the *entire loop* into some sort of optimal execution (which could be very wide, if you want to add eg 512 bits worth of ALU as your execution unit).
Now why is this interesting? Well firstly, in a certain sense, it's even more scalable than SVE *and* plays well to a company whose skill in is the micro-architecture.
But even more so, it looks very similar to a design from about fifteen years ago (evolved over quite a few years) by Jeff Gonion at Apple! The design is spread over many patents, but a representative one is
https://patents.google.com/patent/US8412914B2
(Note that the last patent I can find in the series is 2015, so Gonion works on this, apparently, to tell from the patents, as his entire focus, and Apple is happy to support him doing so, for at least 7 years.
Did they just abandon the idea, or subsume it all into SVE?)
So here's a crazy idea: Apple's equivalent of ARMv9 will be essentially a macroscalar or SVP64 type processor!?!
Apple can add just a few instructions and registers (possibly not even public at first, like AMX) to define the HW loops, and can internally implement SVE in terms of this underlying mechanism.
If we believe this (big if!) it will explain why they've not especially cared about being first with SVE – my guess is they will support it out of convenience, but their real endgame, the way they will stay ahead of ARM even once ARMv9 and SVE become standard everywhere, is by having the additional flexibility inherent in this sort of design. (Flexibility both to grow wider with subsequent cores, but also to "vectorize" even more loops(?), at lower power(?), than SVE and its flexible vector lengths+predicates.
Well, just a few months till the A16 when more is, hopefully, revealed. At which point I'll look like either a visionary or a fool! (Though of course I can always claim there's really a sekrit macroscalar engine inside the CPU handling some of the SVE sequencing and, just you wait, next year or the year after that, it will become visible...)
Suppose you want to execute "vector" type code as aggressively as possible.
One way to do this is via packed SIMD. The disadvantages are that you're limited to a fixed register length (though that can be worked around to some extent if you start wide but implement narrow...) and so you do the MMX to AVX to AVX256 to AVX512 business.
A second option is SVE. That gives you a little more future growth flexibility (though ARM has probably taken the flexibility way beyond what makes any sense) and some degree of simplification of head+tail loop cleanup, even beyond the predicates provided by AVX512.
But there is a third option, as described in POWER's recent Scalable Vector proposal
https://libre-soc.org/openpower/sv/
which you could argue is in the spirit of "modern" micro-architecture (ie screw concepts like RISC vs CISC, what matters is optimal conveyance of the *goal* from the compiler to the CPU, with throwing large amounts of transistors at the problem not really an issue). Thus, to simplify dramatically, SVP64 rather than adding thousands of new instructions ala AVX512 or SVE adds just a few instructions (and a few more registers) that carefully describe certain types of loops, with the idea being that the CPU will translate the *entire loop* into some sort of optimal execution (which could be very wide, if you want to add eg 512 bits worth of ALU as your execution unit).
Now why is this interesting? Well firstly, in a certain sense, it's even more scalable than SVE *and* plays well to a company whose skill in is the micro-architecture.
But even more so, it looks very similar to a design from about fifteen years ago (evolved over quite a few years) by Jeff Gonion at Apple! The design is spread over many patents, but a representative one is
https://patents.google.com/patent/US8412914B2
(Note that the last patent I can find in the series is 2015, so Gonion works on this, apparently, to tell from the patents, as his entire focus, and Apple is happy to support him doing so, for at least 7 years.
Did they just abandon the idea, or subsume it all into SVE?)
So here's a crazy idea: Apple's equivalent of ARMv9 will be essentially a macroscalar or SVP64 type processor!?!
Apple can add just a few instructions and registers (possibly not even public at first, like AMX) to define the HW loops, and can internally implement SVE in terms of this underlying mechanism.
If we believe this (big if!) it will explain why they've not especially cared about being first with SVE – my guess is they will support it out of convenience, but their real endgame, the way they will stay ahead of ARM even once ARMv9 and SVE become standard everywhere, is by having the additional flexibility inherent in this sort of design. (Flexibility both to grow wider with subsequent cores, but also to "vectorize" even more loops(?), at lower power(?), than SVE and its flexible vector lengths+predicates.
Well, just a few months till the A16 when more is, hopefully, revealed. At which point I'll look like either a visionary or a fool! (Though of course I can always claim there's really a sekrit macroscalar engine inside the CPU handling some of the SVE sequencing and, just you wait, next year or the year after that, it will become visible...)