By: Kevin G (kevin.delete@this.cubitdesigns.com), September 27, 2021 9:46 am
Room: Moderated Discussions
Andrey (andrey.semashev.delete@this.gmail.com) on September 25, 2021 4:46 am wrote:
> -.- (blarg.delete@this.mailinator.com) on September 24, 2021 6:10 pm wrote:
> > Andrey (andrey.semashev.delete@this.gmail.com) on September 24, 2021 8:29 am wrote:
> > > Should SVE prohibit 136-bit
> > > vectors? No, because your cache line and page sizes may be a multiple of 17. In such an implementation,
> > > 136-bit vectors would be a reasonable choice. But in the power-of-2 world we live in, 136-bit vectors
> > > are nonsensical (possibly, aside from some very special purpose hardware).
> >
> > Well, from what I can gather, it sounds like you're suggesting
> > that non power-of-2 hardware may, in fact, eventuate.
>
> Not in mass market, I don't think. Some specialized controllers, perhaps, though I have
> no idea what work loads would require a non-power-of-2 design, and SVE to boot.
Three and six element vectors are relatively common for 3D work. Early in the history of SIMD when it made sense to have a CPU code path for this, the code would simply round out to the next largest power of 2 vector size and live with the 25% inefficiency. (Three 32 floats would run on a 128 bit SIMD unit etc.) Nowadays such bulk work is done on GPUs where vector elements are decomposed and that 25% inefficiency is recovered.
I can also imagine some DSP work where non-power of 2 data sets would be leveraged.
The use-case for SVE2 would be for a system without any sort of other dedicated hardware to handle these workloads. Given how inexpensive it is to get some sort of decent dedicated hardware, it is difficult to imagine SVE2 being leveraged for this niche scenario.
> > > By default - nothing. I mean, so long as you're targeting an unknown abstract implementation,
> > > you may as well forget about alignment, instruction choices and scheduling and other micro-optimizations
> > > and just write the unaligned loop that maintains correctness.
> >
> > Which is what I've been trying to portray the whole time really. The way
> > SVE is designed encourages developers to not bother with aligning at all.
>
> Again, you're talking about the spec, and I'm talking about the actual hardware. If all or the absolute
> majority of implementations don't care about alignment then that's great and developers don't need
> to optimize for it. But we don't have that many SVE implementations, and for other kind of instructions
> real hardware does care about alignment, so I'm going do assume SVE won't be an exception. Hardware
> designers will have to prove that alignment doesn't matter, they haven't done that yet.
>
> > > Yes, heterogeneous cores are a pain, and it looks like x86 will follow suit in the near future.
> >
> > So far, it's Intel only, and I feel the core types on Alderlake aren't wildly different enough to be
> > as important (to use your example of MOVDQU/PSHUFB). Of course, this could change in the future.
>
> Rumor has it that AMD is also working on a hybrid CPU, possibly in Zen 5. I
> think, eventually hybrid designs will settle in x86 desktops and servers.
>
> > On the other hand, ARM typically includes both in-order and OoO cores together.
> > Whilst I've never really found a need to specifically target cores yet, my
> > point was that it's generally more difficult to do in the ARM ecosystem.
> >
> > Considering the added difficulty of targeting specific
> > processors on ARM, I get the feeling that support for
> > such isn't a priority. I don't really think that's a bad idea - the vast majority of programs aren't going
> > to optimise to such an extent - but it does feed into the idea of encouraging generic implementations.
>
> ARM was not really present in the HPC domain, certainly not as long as x86 was. Traditionally, because of low
> CPU performance, heavy lifting tasks like video encoding was done with specialized hardware in the ARM world,
> while it was mostly in software in the x86 world. This is starting to change as ARM performance grows and it's
> starting to appear on desktops and servers, so there will be more incentive to optimize for it.
>
> > > Side note about hybrid designs where different cores are radically different, like x86+ARM that AMD did.
> > > That sort of combination is a somewhat different story.
> > > The cores in such a hybrid are inherently incompatible
> >
> > I've never heard of such a configuration. The closest thing
> > I've heard of is their K12 core being designed alongside
> > Zen, which was planned to be socket compatible, but that doesn't mean you can run the two together.
> > Am I missing something?
>
> I thought I've seen somewhere news about a hybrid x86+ARM core from AMD, but
> I can't find that source now. Hmm, maybe I'm misremembering this, sorry.
It isn't a hybrid design but AMD does embed an ARM core in all of their recent designs to handle security.
https://www.extremetech.com/computing/292722-amds-secure-processor-firmware-is-now-explorable-thanks-to-new-tool
Part of the Zen1 and K12 (ARM) cores were going to share a bits of the SoC design. Part of the CPU cores were also going to be common to decrease validation time and increase time to market. These aspects were generally outside of the ISA specific functionality in the cores, like branch prediction algorithms, power management etc. Of course AMD never shipped a commercial K12 core.
> AMD SkyBridge (https://www.extremetech.com/computing/181867-amds-project-skybridge-new-arm-and-x86-chips-that-are-pin-compatible)
> though was in the works. I imagine, in multi-socket systems you could use both x86 and ARM cores
> together. The project is dead now, so we probably will never know.
The project was to have a common platform between ARM and x86 so that a single motherboard can be used between both processor types. This isn't the first time has happened: there were a couple of early Athlon MP boards with processor slots that could be swapped with an Alpha processor card and a firmware swap as those chips shared the same FSB topology. AMD was not aiming to dynamically have both x86 and ARM active simultaneously in the same system for user space applications.
Leveraging two processor architectures simultaneously on a system is feasible as long as one architecture is the primary and the other is a defined coprocessor. That's how delegating compute is done today to GPU hardware. The challenge is having two primary architectures sharing common memory addressing schemas to have a single user space application running that can concurrently exist with both ISAs. That means code structures are replicated and how to swap between them has to be defined by the operating system. This highlights the really hard part of the problem here: the software side is just as tricky as getting the hardware side right. Certain aspects could be done without much challenge like keeping user space programs pinned to a single architecture while system calls are can leverage either architecture based upon need/demand.
IBM also had the PowerPC 615 project which was to dynamically switch between PowerPC and x86 modes on-the-fly. It reportedly had silicon and some were shipped out in sample volumes for contractual reasons (contrary to the link below) but it is nearly mythical.
https://www.theregister.com/1998/10/01/microsoft_killed_the_powerpc/
> -.- (blarg.delete@this.mailinator.com) on September 24, 2021 6:10 pm wrote:
> > Andrey (andrey.semashev.delete@this.gmail.com) on September 24, 2021 8:29 am wrote:
> > > Should SVE prohibit 136-bit
> > > vectors? No, because your cache line and page sizes may be a multiple of 17. In such an implementation,
> > > 136-bit vectors would be a reasonable choice. But in the power-of-2 world we live in, 136-bit vectors
> > > are nonsensical (possibly, aside from some very special purpose hardware).
> >
> > Well, from what I can gather, it sounds like you're suggesting
> > that non power-of-2 hardware may, in fact, eventuate.
>
> Not in mass market, I don't think. Some specialized controllers, perhaps, though I have
> no idea what work loads would require a non-power-of-2 design, and SVE to boot.
Three and six element vectors are relatively common for 3D work. Early in the history of SIMD when it made sense to have a CPU code path for this, the code would simply round out to the next largest power of 2 vector size and live with the 25% inefficiency. (Three 32 floats would run on a 128 bit SIMD unit etc.) Nowadays such bulk work is done on GPUs where vector elements are decomposed and that 25% inefficiency is recovered.
I can also imagine some DSP work where non-power of 2 data sets would be leveraged.
The use-case for SVE2 would be for a system without any sort of other dedicated hardware to handle these workloads. Given how inexpensive it is to get some sort of decent dedicated hardware, it is difficult to imagine SVE2 being leveraged for this niche scenario.
> > > By default - nothing. I mean, so long as you're targeting an unknown abstract implementation,
> > > you may as well forget about alignment, instruction choices and scheduling and other micro-optimizations
> > > and just write the unaligned loop that maintains correctness.
> >
> > Which is what I've been trying to portray the whole time really. The way
> > SVE is designed encourages developers to not bother with aligning at all.
>
> Again, you're talking about the spec, and I'm talking about the actual hardware. If all or the absolute
> majority of implementations don't care about alignment then that's great and developers don't need
> to optimize for it. But we don't have that many SVE implementations, and for other kind of instructions
> real hardware does care about alignment, so I'm going do assume SVE won't be an exception. Hardware
> designers will have to prove that alignment doesn't matter, they haven't done that yet.
>
> > > Yes, heterogeneous cores are a pain, and it looks like x86 will follow suit in the near future.
> >
> > So far, it's Intel only, and I feel the core types on Alderlake aren't wildly different enough to be
> > as important (to use your example of MOVDQU/PSHUFB). Of course, this could change in the future.
>
> Rumor has it that AMD is also working on a hybrid CPU, possibly in Zen 5. I
> think, eventually hybrid designs will settle in x86 desktops and servers.
>
> > On the other hand, ARM typically includes both in-order and OoO cores together.
> > Whilst I've never really found a need to specifically target cores yet, my
> > point was that it's generally more difficult to do in the ARM ecosystem.
> >
> > Considering the added difficulty of targeting specific
> > processors on ARM, I get the feeling that support for
> > such isn't a priority. I don't really think that's a bad idea - the vast majority of programs aren't going
> > to optimise to such an extent - but it does feed into the idea of encouraging generic implementations.
>
> ARM was not really present in the HPC domain, certainly not as long as x86 was. Traditionally, because of low
> CPU performance, heavy lifting tasks like video encoding was done with specialized hardware in the ARM world,
> while it was mostly in software in the x86 world. This is starting to change as ARM performance grows and it's
> starting to appear on desktops and servers, so there will be more incentive to optimize for it.
>
> > > Side note about hybrid designs where different cores are radically different, like x86+ARM that AMD did.
> > > That sort of combination is a somewhat different story.
> > > The cores in such a hybrid are inherently incompatible
> >
> > I've never heard of such a configuration. The closest thing
> > I've heard of is their K12 core being designed alongside
> > Zen, which was planned to be socket compatible, but that doesn't mean you can run the two together.
> > Am I missing something?
>
> I thought I've seen somewhere news about a hybrid x86+ARM core from AMD, but
> I can't find that source now. Hmm, maybe I'm misremembering this, sorry.
It isn't a hybrid design but AMD does embed an ARM core in all of their recent designs to handle security.
https://www.extremetech.com/computing/292722-amds-secure-processor-firmware-is-now-explorable-thanks-to-new-tool
Part of the Zen1 and K12 (ARM) cores were going to share a bits of the SoC design. Part of the CPU cores were also going to be common to decrease validation time and increase time to market. These aspects were generally outside of the ISA specific functionality in the cores, like branch prediction algorithms, power management etc. Of course AMD never shipped a commercial K12 core.
> AMD SkyBridge (https://www.extremetech.com/computing/181867-amds-project-skybridge-new-arm-and-x86-chips-that-are-pin-compatible)
> though was in the works. I imagine, in multi-socket systems you could use both x86 and ARM cores
> together. The project is dead now, so we probably will never know.
The project was to have a common platform between ARM and x86 so that a single motherboard can be used between both processor types. This isn't the first time has happened: there were a couple of early Athlon MP boards with processor slots that could be swapped with an Alpha processor card and a firmware swap as those chips shared the same FSB topology. AMD was not aiming to dynamically have both x86 and ARM active simultaneously in the same system for user space applications.
Leveraging two processor architectures simultaneously on a system is feasible as long as one architecture is the primary and the other is a defined coprocessor. That's how delegating compute is done today to GPU hardware. The challenge is having two primary architectures sharing common memory addressing schemas to have a single user space application running that can concurrently exist with both ISAs. That means code structures are replicated and how to swap between them has to be defined by the operating system. This highlights the really hard part of the problem here: the software side is just as tricky as getting the hardware side right. Certain aspects could be done without much challenge like keeping user space programs pinned to a single architecture while system calls are can leverage either architecture based upon need/demand.
IBM also had the PowerPC 615 project which was to dynamically switch between PowerPC and x86 modes on-the-fly. It reportedly had silicon and some were shipped out in sample volumes for contractual reasons (contrary to the link below) but it is nearly mythical.
https://www.theregister.com/1998/10/01/microsoft_killed_the_powerpc/