VLIW Processing Engines
The PEs in each tile are minimalist VLIW cores, that use a 96 bit instruction word to encode up to 8 operations using a 10 port register file. The emphasis here is on minimalist; the PEs don’t have caches, virtual memory, coherency or any of the other niceties that are found in modern embedded processors, let alone high performance ones. The operations in each word are two single precision FMACs, load, store, network send, network receive, branch and power management. The load and store instructions can access the 2KB data memory, and the instructions are fetched from a 3KB instruction memory. Roughly 74% of the PEs are covered by NMOS sleep transistors, which have a 5.4% area and 4% frequency penalty. The FMAC units use a 9 stage pipeline, with a single stage for accumulation. They have either a 3 or 6 cycle pipelined wake up, which reduces current spikes, but allows execution to initiate after a single cycle.
Figure 3 – Tile Microarchitecture
One future research direction for is 3D integration. The Teraflops group intends to connect to an external SRAM that is packaged below the device. This offers huge benefits in terms of bandwidth and latency, but is limited by power dissipation. Only one chip can be mated to a heatsink, and all others must have relatively little thermal dissipation. Initial 3D integration will probably use internally manufactured SRAMs, however, a natural evolution would be to use DRAMs. DRAMs have the advantage that they offer vastly greater capacity, lower power consumption and lower cost. However, Intel has long since exited that business and would have to work with an external partner.