General Register State
AArch32 was not particularly regular and one of the biggest complications was the relationship between the registers and exception modes. AArch32 includes 13 general registers (R0-12), the Program Counter (R15) and 2 banked registers that contain the Stack Pointer (R13) and Link Register (R14). The user and system modes share these 16 registers and a Program Status Register (PSR). The fast interrupt (FIQ) mode shares R0-7 and the PC, with its own private R8-14 and Saved PSR. All other exception modes have private banked registers and Saved PSRs. This complicated register banking was one of the techniques originally used to reduce the latency for exceptions, which made ARM particularly suitable for embedded controllers. However, this has the drawback of requiring >40 registers, of which less than half can be used simultaneously – a clear problem from the standpoint of power and area efficiency.
Like x86, ARM took the opportunity to extend, expand and simplify the architectural registers. Naturally, the new GPRs are all 64-bits wide to handle larger addresses. 32-bit accesses use the lower half of registers and either ignore or zero out the upper half. There are more GPRs, and the banking is reduced to 4 different levels. There are 30 GPRs (X0-29), a Procedure Link Register (X30), and X31 acts as a hardwired zero register. Unlike A32, the PC is a special named register that can only be used for explicit control flow instructions and certain addressing modes. Additionally, each of the 4 privilege levels has 3 private banked registers, the Exception Link Register, Stack Pointer and Saved PSR. The AArch32 registers map onto the lower half of the AArch64 registers, which enables running AArch32 on top of AArch64.
Vector Register State
As with most popular architectures, ARMv7 has scalar floating point (VFP) and vector extensions with integer and floating point data (NEON, also known as Advanced SIMD). In ARMv7, these two extensions share a single register file. Both VFP and SIMD are carried over to AArch64, along with the shared register file. However, the two extensions are a standard part of ARMv8, whereas they were optional in some ARMv7 implementations.
Previously, there were 32 vector registers, each 64-bits wide. Pairs of adjacent registers were aliased to provide 16 virtual 128-bit registers for the SIMD instructions. Leaving no stone unturned, ARM’s architects took the opportunity to tweak this arrangement. In ARMv8, all 32 vector registers (V0-V31) are extended to 128-bits, doubling the capacity. Instead of using pairs of smaller registers to form larger virtual registers, the lower half of these 128-bit registers alias to the existing 64-bit registers. As with the GPRs, partial accesses will either ignore or zero out the upper half of a vector register.
ARMv8 Instruction Set Changes
As to be expected, the most substantive changes in the A64 ISA are the memory model and related instructions. The A64 instruction set that operates on AArch64 is largely similar to the existing ISA, but without various idiosyncrasies that are problematic for modern microprocessors. As with A32, instructions are fixed length, requiring 32-bits to specify as many as 3 operands.
Unlike ARMv7 though, there is currently no support for a 64-bit version of Thumb to improve instruction density. One challenge to a potential T64 instruction set is that larger addresses and branch offsets will crowd the instruction format. A64 already decreased the offset range for conditional branches to +/-1MB (from +/-32MB in ARMv7), and further reductions would be rather painful.
All instructions in A32 were conditional, using predication. However, predication uses bits in the instruction encoding that are already at a premium due to doubling the number of registers in AArch64. Moreover, predication complicates out-of-order execution since it adds an extra input that must be renamed. In A64, the only conditional instructions are branch, comparison and select. While this will slightly increase code size, the simplification is worthwhile.
Similarly, A32 had an in-line shifter that could be used for ‘free’ with nearly every integer instruction. However, shifts are notorious difficult to implement at high frequency due to the complicated wiring. An implicit shift in every instruction will increase the length of pipeline stages and reduce the overall frequency. A64 instructions can apply a very limited shift to the destination register, and there are new instructions to handle more complicated cases such as variable shifts.
Most of the changes to VFP in A64 are relatively minor. VFP is intended for single precision and double precision scalar computation. There are new instructions to satisfy the IEEE754-2008 standard, particularly calculating the min and max of two numbers. Floating point comparisons now set the integer condition flags, rather than the flags in the FP Status Register. There are also new conversion instructions between various FP formats, and also the new 64-bit integers.
The advanced SIMD is the vector counterpart to VFP and has been more aggressively enhanced. In particular, the vector instructions in ARMv7 could operate on integer, single precision and rarely polynomial data. In A64, the vector elements also include double precision floating point and have full IEEE support with the required rounding modes, and handling of denormals and NaNs.
Similar to SSE or AVX, the advanced SIMD instructions are variable length vectors that depend on the size of the registers and data types. The underlying registers are either 64-bit or 128-bit and can pack from 1-16 elements. The integer data types are mostly unchanged, spanning from a single byte to 64-bits. Floating point data can be stored in half-precision, but operations are all single precision or double precision.
With the move to 128-bit registers, A64 includes new instructions for inserting and extracting vector elements. There are also three cross-lane instructions for vector reductions, specifically summing and taking the minimum or maximum value.
Several existing instructions including comparison, add, absolute value and negation have been extended to operate on 64-bit integer elements. There are also new instructions for data type conversion, floating point normalization and saturating integer arithmetic. Lastly, ARMv8 includes a variety of optional cryptographic instructions, that are intended to complement existing hardware accelerators. ARM opted to focus on AES encryption, the SHA1 and SHA256 hashing algorithms and Galois fields with 16 new instructions.
Discuss (195 comments)