ARM1/ARM2 Alternative? (20/20 Hindsight)

By: Paul A. Clayton (, June 27, 2019 5:47 pm
Room: Moderated Discussions
With the same basic constraints of ARM2, how would hindsight change the architectural and microarchitectural design choices?

The basic constraints seemed to include no off-chip cache, chip cost, and perhaps ease of porting Acorn's BASIC implementation and other 6502 assembly code. For this thought experiment, modern design knowledge is assumed (familiarity with microarchitectural tricks, issues regarding binary compatibility and Moore's Law, hardware design, etc.).

Providing condition codes might make 6502 software porting significantly easier, might provide lower branch penalties, and might have other advantages.

I suspect that full predication would not have been included. Despite the benefits of such given the bandwidth constraints from limited pin-out and no off-chip cache, I suspect supporting 16-bit instructions (like Thumb2) would be a better alternative. (Thumb2 does include If-Then-Else, but it is not clear how much this choice came from legacy factors. Thumb2 implementations do not seem to suffer from original bandwidth constraints. AArch64-like conditional select with simple operations seems likely to be considered.)

I suspect shift and operate instructions would probably have been at least reduced in generality. Like predication such exploited the very slow (memory-tied) cycle time, but with 16-bit instructions the core could be more decoupled from memory even with very little instruction fetch buffering. (The immediate compression provided by shifting would have to be considered as well as small constant shifts for address generation and multiplication by certain constants.)

Autoincrement and autodecrement memory accesses might not be problematic (and have code density advantages).

Some form of load-store-multiple would probably be provided (because such seems likely to significantly reduce instruction bandwidth contention with data accesses while exploiting open DRAM pages), but the 16-bit encoding seems unlikely to be selected. PowerPC's range limit encoding or a block-based encoding where flag bits indicate selection of variably-sized blocks (e.g., 8-register, 4-register, 2-register, 2-register) seem a little more attractive. Even with a small stack cache, I suspect such would be helpful.

I think making call and return save and restore/use two words would probably be worthwhile. Buffering these values would be less expensive than providing more registers and the use of a complex instruction would reduce fetch/data contention. I also think return could include an immediate stack pointer adjustment (this would be a two-destination instruction, but banking could be used for in-order implementations).

The PC would probably not be tied to a GPR. Some support for PC-relative addressing would probably be provided. Mixing code and data on the same DRAM page has significant attraction for exploiting same-page bandwidth; it might be desirable to support a small instruction that loads such local data using a PC-"section" offset/inset rather than a PC offset. (Such would allow a load address to be presented quickly and might use the instruction cache in later implementations. Inverting the most significant bit(s) of the PC within the "section" would allow greater data reach for a given inset. It would even be possible (though probably not useful) to use BTB entries to record PC-inset loads.)

Rather than providing shadow registers for fast interrupts and OS state, extra state might be generalized as potential variable-sized (and sharing) thread contexts. The same basic design could support more numerous tiny contexts or fewer larger contexts. Each context, in addition to having a register set and PC could have a context ID (for a simple implemention this might be the same as the mode for ARM2, but such mode bits could also be interpreted as an index into a context ID table in the memory controller).

(I don't know if the 3D register file trick would be useful for such early semiconductor processes and low port count register files, but future implementations could exploit such temporal banking.)

Registers could be renamed at finer granularity than they are guarded; e.g., an 8-register thread associated with a context ID could have a one bit thread selector which would invert the most significant register index bit in its thread to allow a compiler/assembler to assign registers 0-3 to subthread contexts rather than having to track the thread number and explicitly rename registers during allocation. The "upper" registers could still be used by a tiny context as long as they were saved and restored and the other subthread was never run before the values were restored. (Interrupt nesting might also be an early design consideration even if it was not implemented initially.)

(It might be interesting if interrupts could introduce more arguments to the processor than just interrupt type. Including the interrupt type in general purpose registers would also seem to reduce overhead. In some cases such could avoid explicit memory accesses to I/O addresses and the potential high turn-around time for such. Such might have been too complex to implement with existing devices, but it seems like a nice optimization.)

Architecting some degree of register banking might have been useful, but such might have been too difficult to exploit with practical compilers.

The rotated unaligned word load might have been interesting combined with configurable address masking. One could then load a pointer with metadata in the lowest byte and immediately use the pointer (one could also use small values known not to overflow by loading the critical value into the bottom of the register; not likely for a compiler but an assembly programmer could do this). With limited memory access bandwidth such could be useful.

In terms of microarchitecture, I suspect that a small instruction buffer would be provided. Since a two-byte buffer would be needed to exploit variable-length instructions, increasing its size would not be extraordinarily expensive. Providing a small stack cache might be worthwhile for reducing memory traffic, but the chip area cost might have been excessive.

(A global/Knapsack and stack cache might have been practical for the second commercial implementation. ARM3 included a shared on-chip cache. I suspect using CAMs for ARM3's cache was a mistake in hindsight; partially shared caches and specialized data subcaches could have reduced associativity issues and pair-wise way refinement (where a partial tag selects one way or the other) might have been practical for reducing power.)

Not matching memory and processor cycle time might have been appropriate. With 16-bit instructions and an instruction buffer decoupling fetch from execution, being able to execute two instructions in one memory cycle time could be helpful. (A 16-byte loop buffer would allow tiny simple loops to use the full memory bandwidth for data accesses.)

The chip size (cost) constraints may have made some optimizations impractical. (Note the question is also assuming sufficiently numerous and experienced hardware designers as long as such would not substantially hurt design time even though such would presumably have been too expensive.)

I suspect some of my thoughts are just wrong (I do not know the detailed tradeoffs at that time, even when discounting head-count/experience budget constraints, and some of the above is a bit WACI). I know I also left many considerations unaddressed. I do hope that posters will fill in the blanks and excise the mistakes.
 Next Post in Thread >
TopicPosted ByDate
ARM1/ARM2 Alternative? (20/20 Hindsight)Paul A. Clayton2019/06/27 05:47 PM
  ARM1/ARM2 Alternative? (20/20 Hindsight)Maxwell2019/06/27 08:10 PM
    ARM1/ARM2 Alternative? (20/20 Hindsight)Paul A. Clayton2019/06/28 11:44 AM
      ARM1/ARM2 Alternative? (20/20 Hindsight)RichardC2019/07/03 07:56 PM
        ARM1/ARM2 Alternative? (20/20 Hindsight)Simon Farnsworth2019/07/04 04:09 AM
          DMARichardC2019/07/04 05:52 AM
            DMASimon Farnsworth2019/07/04 09:46 AM
              DMARichardC2019/07/04 10:54 AM
                DMAanon2019/07/04 05:53 PM
                  DMASimon Farnsworth2019/07/05 01:51 AM
                  DMARichardC2019/07/05 08:24 PM
            DMAMaxwell2019/07/04 09:49 AM
              DMAHoward Chu2019/07/04 10:55 AM
              DMARichardC2019/07/04 11:00 AM
          ARM1/ARM2 Alternative? (20/20 Hindsight)Etienne2019/07/04 08:06 AM
            ok once you have MMURichardC2019/07/04 08:46 AM
  ARM1/ARM2 Alternative? (20/20 Hindsight)Etienne2019/06/28 01:52 AM
  ARM1/ARM2 Alternative? (20/20 Hindsight)jv2019/06/28 07:20 AM
    ARM1/ARM2 Alternative? (20/20 Hindsight)Paul A. Clayton2019/06/28 11:44 AM
      ARM1/ARM2 Alternative? (20/20 Hindsight)jv2019/06/29 03:54 AM
        Freeing the stack pointerPaul A. Clayton2019/06/29 06:32 AM
          PC-relative LD/ST (NT)vvid2019/06/30 10:03 AM
          Freeing the stack pointerjv2019/06/30 11:45 PM
  ARM1/ARM2 Alternative? (20/20 Hindsight)Ronald Maas2019/06/28 09:06 AM
    ARM1/ARM2 Alternative? (20/20 Hindsight)Paul A. Clayton2019/06/28 12:56 PM
      ARM1/ARM2 Alternative? (20/20 Hindsight)Ronald Maas2019/06/28 10:17 PM
        ARM1/ARM2 Alternative? (20/20 Hindsight)Brett2019/06/29 12:39 AM
          ARM1/ARM2 Alternative? (20/20 Hindsight)Brett2019/06/29 01:13 AM
          32-bit Win10 exists (NT)nobody in particular2019/06/29 05:17 PM
            32-bit Win10 existsBrett2019/06/29 06:45 PM
              32-bit Win10 existsMichael S2019/06/30 01:34 AM
                32-bit Win10 existsAnon32019/06/30 03:07 AM
        AArch64 is a new ISAPaul A. Clayton2019/06/29 07:23 AM
          AArch64 is a new ISArwessel2019/06/29 04:00 PM
            AArch64 is a new ISAMichael S2019/06/30 01:40 AM
              Hardware x87?Gionatan Danti2019/06/30 02:22 AM
                Hardware x87?Michael S2019/06/30 03:52 AM
                  Hardware x87?Gionatan Danti2019/06/30 06:04 AM
                    Hardware x87?Michael S2019/06/30 08:47 AM
                  Hardware x87?Kevin G2019/07/01 12:11 PM
                    Hardware x87?anonymou52019/07/01 07:30 PM
                      Hardware x87?Michael S2019/07/02 12:44 AM
                      Hardware x87?Gionatan Danti2019/07/02 09:25 AM
              AArch64 is a new ISArwessel2019/06/30 01:52 PM
            AArch64 is a new ISAMichael S2019/06/30 01:42 AM
        ARM1/ARM2 Alternative? (20/20 Hindsight)Maynard Handley2019/06/29 09:50 AM
        ARM1/ARM2 Alternative? (20/20 Hindsight)Michael S2019/06/30 01:29 AM
          ARM1/ARM2 Alternative? (20/20 Hindsight)Wilco2019/06/30 03:51 AM
            ARM1/ARM2 Alternative? (20/20 Hindsight)Michael S2019/06/30 04:22 AM
              ARM1/ARM2 Alternative? (20/20 Hindsight)Wilco2019/06/30 05:27 AM
                ARM1/ARM2 Alternative? (20/20 Hindsight)Michael S2019/06/30 05:53 AM
                  ARM1/ARM2 Alternative? (20/20 Hindsight)Wilco2019/07/02 01:49 AM
                    ARM1/ARM2 Alternative? (20/20 Hindsight)Michael S2019/07/02 04:24 AM
                      ARM1/ARM2 Alternative? (20/20 Hindsight)Wilco2019/07/02 05:28 PM
                        ARM1/ARM2 Alternative? (20/20 Hindsight)Michael S2019/07/03 01:37 AM
                          ARM1/ARM2 Alternative? (20/20 Hindsight)Adrian2019/07/03 02:45 AM
                            ARM1/ARM2 Alternative? (20/20 Hindsight)Michael S2019/07/03 03:01 AM
                            ARM1/ARM2 Alternative? (20/20 Hindsight)Montaray Jack2019/07/03 12:18 PM
                              ARM1/ARM2 Alternative? (20/20 Hindsight)Montaray Jack2019/07/03 01:46 PM
                        ARM1/ARM2 Alternative? (20/20 Hindsight)Montaray Jack2019/07/03 02:32 PM
Reply to this Topic
Body: No Text
How do you spell purple?