Coherency and Multi-Processor Configuration
In the previous article, the Load-Store unit (LS) in the SPE was described in some detail. The LS unit is a non-coherent local address space used by the SPE. However, the description of the non-coherent LS created some confusion in that the Element Interface Bus (EIB) was described to be a coherent ring, supported by the ATO, presumably some form of atomic memory action unit. The question was raised as to why the SPE’s would snoop addresses on the EIB if the LS is a non-coherent local address space. Fortunately, IBM’s paper on the CELL processor at the HPCA conference provided the explanation and some details on this topic:
Unlike Power processors, the SPEs operate only on their local memory (local store or LS). Code and data must be transferred into the associated LS for an SPE to execute or operate on. Local Store addresses do have an alias in the Power processor address map and transfers to and from Local Store to memory at large (including other Local Stores) are coherent in the system. As a result, a pointer to a data structure that has been created on the Power processor can be passed to an SPE and the SPE can use this pointer to issue a DMA command to bring the data structure into its local store in order to perform operations on it. If after on this data structure the SPE (or Power core) issues a DMA command to place it back into non-LS memory, the transfer is again coherent in the system according to the normal Power memory ordering rules. Equivalents of the Power processor locking instructions, as well as a memory mapped mailbox per SPE, are used for synchronization and mutual exclusion.
IBM also released more details regarding the processor interconnect of the CELL processor at HPCA. As previously described, the 12 byte lanes on the external interfaced of the processor are arranged into two groups of ports. One group of ports is dedicated to non-coherent off-chip traffic, while the other group of ports can be for coherent off-chip traffic. At HPCA, IBM revealed that the CELL processor can support a glueless two way (S)MP configurations where two CELL processors can connect to each other in a manner similar to how two Opteron processors connect to each other via a cache-coherent Hypertransport (ccHT) interconnect. However, to support N-way MP (N > 2), a coherent switch is required. Specifically, the question was posed as to whether multiple CELL processors can be connected as a coherent ring gluelessly, and the answer was that the glueless ring topology is not possible.
Prototype Rackmount CELLs
Yet another intriguing detail released by IBM at HPCA is that Sony and IBM are collaborating on a series of prototypes to explore applications that are decidedly outside of the domain of game consoles. Specifically, CELL Processor Based Workstations (CPBW) have been booted and are currently undergoing testing and evaluation. IBM estimates that a rack filled with CELL processors can provide upwards of 16 Tera-Flops per rack.
Figure 4 – One compute node with 2 CELL processors
Figure 4 shows a two processor node used in the CELL prototype rack. Figure 4 shows that each node consists of two cell processors that is connected gluelessly. However, the two processors connect to the same I/O bridge and is then connected to the inter-node high bandwidth networks as well as node-specific storage and additional node-control hardware.
The proclaimed target workloads for the prototype rack are:
- Computer Entertainment.
- Real-time rendering
- Physics simulations
Discuss (5 comments)