Pages: 1 2 3 4 5 6 7 8 9 10 11 12
SPE Overview

Figure 3 – SPE die photo with functional unit overlay
Figure 3 shows the die photo of the Synergistic (or just plain SIMD) Processing Element (SPE). The SPE is a specialized processing element dedicated to the computation of SIMD-type data streams. The SPE has 256KB of private memory, referred to as the Load Store (LS) unit, implemented as four separate arrays of 64 KB each. The LS is a private, non-coherent address space that is separate from the system address space. The LS is implemented using ECC protected arrays of single ported SRAM. The LS has been optimized to sustain high bandwidth and small cell size. The cell size is 0.99µm2 on the 90nm SOI process, and access latency to LS is 6 cycles.
SPE Architecture
To minimize usage of non-computational hardware, the SPE does not have hardware for data fetch and branch prediction. These tasks are instead relegated to software. The SPE implements an improper subset of the VMX instruction set, and all instructions are 32 bits in length. The SPE instructions operate on a unified register file with 128 registers. The registers are 128 bits in width and most instructions operate on the 128 bit operands by treating them as four separate 32 bit operands. Due to the 18 cycle branch misprediction penalty and the lack of a branch predictor, tremendous effort will have to be devoted to avoiding branches. The inclusion of the large register file is thus a necessary element in eliminating unnecessary branches via loop unrolling.

Figure 4 – SPE Organization
The SPE is an in-order processor that can issue two instructions per cycle to seven execution units in two different pipelines. Typically, each instruction makes use of 3 source operands to produce 1 result. The operands are fetched from either the register file or the forward network. Due to the in-order nature of the pipeline and the strict issue rules, the processor makes use of the forwarding network to minimize execution bubbles. To support the dual issue pipeline, each of which can source 3 operands and produce one result per cycle, the register file has 6 read ports and 2 write ports. Register file access takes 2 cycles.
Pages: « Prev 1 2 3 4 5 6 7 8 9 10 11 12 Next »
Discuss (6 comments)