ARM’s Race to Embedded World Domination

Pages: 1 2 3 4 5 6 7

Dr Java and Mr Hyde

The StrongARM branch of the ARM family isn’t the only one to advance. The 1996 ARM8 retained the original three stage ARM execution pipeline and relied on a shrink to 0.5 um, three level metal CMOS process to increase the performance to 84 MIPS at 72 MHz. In contrast the 1997 ARM9 core stretched the execution pipeline to the same five-stage organization used in the first generation StrongARM. The ARM9TDMI core occupied 4.8 mm2 in 0.35 um and took 111k transistors and yielded 220 MIPS at 200 MHz.

A second revision of the ARM9 architecture called ARM9E, implements SIMD ISA extensions for enhanced performance for multimedia type applications. These extensions are also included in the next generation ARM1020E as well as Intel’s Xscale second generation StrongARM [9]. The new instructions are rather conventional as far as SIMD ISA extensions go and resemble MMX in modern x86 processors. It includes 8 and 16-bit SIMD addition and subtraction, 16 and 32-bit multiplies, and saturating arithmetic operations as well as selection operations.

A far more interesting and innovative new development for ARM is the optional Jazelle architectural enhancement to ARM9E (called ARM9EJ) to accelerate the execution of Java bytecodes [10]. It is similar in concept to Thumb logic – a hardware block that inserted into the instruction execution pipeline between the instruction cache and the ARM instruction decoder. When enabled the Jazelle feature processes Java virtual machine (JVM) bytecodes from the instruction stream and translates them on-the-fly. Simple bytecodes are processed by replacement with equivalent sequences of ARM native RISC instructions that are fed directly to the processor core for immediate execution. More complex bytecodes are trapped and executed by a software JVM [11]. This scheme is shown in Figure 6.


Figure 6 Jazelle Java Execution Hardware Extension

The Jazelle unit is enabled by a new ‘J’ bit in the process status register CPSR. A new instruction ‘BXJ Rn’ was added to the ARM instruction set to invoke Jazelle. The BXJ instruction is a register indirect branch that sets the J bit and begins execution of a Java bytecode routine. The processor continues to execute bytecodes using Jazelle until the occurrence of an unhandled bytecode exception, Java addressing exception, or an ARM hardware exception such as an interrupt. Exceptions are handled in ARM native mode and execution using Jazelle can be resumed immediately upon returning from the exception. The ARM program counter (PC) R15 was made byte addressable to facilitate direct Java execution (native ARM instructions are all 32 bits in size and aligned in memory so the least significant two bits of the PC were previously hardwired to zero).

The ARM Jazelle unit directly executes 140 bytecodes. These include constant loads, variable loads and stores, array loads and stores, integer data operations, branches, quick constant pool loads, and quick static/field operations. The remaining bytecodes are trapped to a software kernel for emulation. The 94 trapped bytecodes include FP operations, integer division, switch, invoke, return, new, unresolved constant pool loads, and unresolved field/static operations. The software kernel needed to implement these complex bytecodes is about half the size of a conventional software JVM. The Jazelle unit itself is rather modest, about 12k extra logic gates and an extra pipeline stage (which is bypassed when in ARM or Thumb mode). The extra hardware complexity increases power consumption of an ARM9 core by about 10%.

According to ARM, Jazelle increases Java application performance by about 8x compared to an equivalently clocked ARM9 core running a pure software JVM. This improvement is shown in Table 2 along with comparisons with other pure software JVM and hardware solutions for other processors.

Table 2 Clock Normalized Java Performance Comparison (Embedded CaffeineMark 3.0, per MHz)

Processor

CaffeineMarks per MHz

R3000

0.52

R4600

0.61

ARM9

0.67

Pentium

0.91

R3000 + JSTAR

2.87

Ajile AJ100

2.9

ARM9 + Jazelle

5.75

It should be noted that the results for the standard ARM9 core as well as other processors in Table 2 could be improved dramatically using new dynamic optimization techniques, such as just-in-time (JIT) compilers. But these techniques introduce latency when invoking new Java methods, and increase system memory requirements to support the compiler and its native code caching. Such considerations might be trivial for desktop computers, but can take on major significance in the type of low cost, highly integrated applications such as cell phones, personal digital assistants, and information appliances that ARM targets.


Pages: « Prev   1 2 3 4 5 6 7   Next »

Discuss (2 comments)