The CONEXIUM Interchange
The CONEXIUM Interchange was designed as a coherent, scalable, on chip interconnect between the processor cores, the L2 cache, the ENVOI I/O and the DDR2 controllers. As shown in Figure 4, CONEXIUM uses a shared address bus, and a partially connected crossbar for data. Every element of the chip is kept coherent using a MOESI protocol, which has been modified to minimize data movement.
Figure 4 – Conexium Interchange (from a P.A. Semi presentation)
The shared address bus runs at half of the PA6T core frequency (so 1GT/s for the initial MPU). The PA6T cores, and the I/O bridge are the only devices which are allowed to initiate transactions across CONEXIUM, but all devices can respond. Because the I/O uses the L2 cache, it has full Direct Memory Access (DMA) over CONEXIUM. The shared address bus is also used to enforce a strongly ordered memory model (the memory model defines whether reads and writes can be re-ordered), since all address bus accesses are serialized. While most high performance systems use weakly ordered models, a strongly ordered model is a much better fit for embedded designs, where programmability and predictability are more important factors. Weakly ordered models can cause poorly written software to break, while strongly ordered models are far more tolerant of modern programming. The partially connected data crossbar is 128 bits wide with full duplex operation (i.e. each device has both a read and a write port) and provides up to 64GB/s of bandwidth. Both the address bus and the data crossbar are parity protected to catch transmission errors.
CONEXIUM is also used by the on-die DDR2 controllers. Each controller implements a single channel of 1066MHz memory, providing a total of 16GB/s of bandwidth. The closed page latency is 55ns, the open page latency is 45ns. The memory is protected with both SECDED ECC and CRC checks. As a result, they can catch x8 DRAM read failures on the first access, and ~97% of x16 DRAM failures.
The topology for CONEXIUM was designed for targeted, but not infinite scalability. Given the factors shown in Figure 1, it is easy to understand why this topology was used. While a ring interconnect, like CELL, would be more scalable, it is rather inefficient for smaller arrangements because of the higher latency. On the other hand, a simple shared bus would not scale far enough and fails to take advantages of being entirely on-chip. The crossbar topology is an effective compromise between the two; it has low latency for small and medium configurations, but sacrifices some scalability to achieve this. The second generation of the PWRficient family may eventually use a different interconnect, but that would probably not be until 2009 or later.