At VLSI, IBM presented a research paper on a programmable accelerator for training neural networks and performing inference using a trained network, with authors from both IBM Research and the IBM Systems group. As Figure 1 illustrates, the accelerator is a highly-regular tiled architecture, comprising a compute array, special-function units, and scratchpad memories. The processor is a dataflow-based design, which naturally maps to neural networks.
The compute array is arranged in a 2-dimensional torus of processing elements (PEs). As Figure 2 shows, each PE is a tiny core; including an instruction buffer, fetch/decode stage, 16-entry register file, a 16-bit floating-point execution units, and binary and ternary ALUs, and fabric links to and from the neighboring PEs. The execution units handle basic operations such as multiply-add with estimation for many non-linear functions such as square root or reciprocal. The bottom row of the compute array is a set of SFUs that are similar to the PEs, but with support for higher-precision 32-bit FP math and datatype conversion between 16-bit and 32-bit. We believe that the array is 32 PEs or SFUs horizontally by 16 PEs and 1 SFU vertically.
Data arrives into the compute array from a two-level scratchpad. Two 8KB L0 scratch pads are positioned along different dimensions of the torus, backed up by a shared 2MB scratchpad. This arrangement enables flexibly supporting different types of dataflow to exploit three different reuse patterns. In a weight stationary approach, the weights of the neural network are stored in the local registers, but move input data and partial sums through the fabric and scratchpads. An output stationary approach moves the inputs and the weights, but keeps the partial sums in the local register file. Last, a row stationary approach exploits locality in multiple dimensions, by mapping a row of weights (e.g., one dimension of a convolution) and a row of inputs to each PE, and then combining together different results to generate the partial sums. The chip management unit (CMU) includes test logic and the accelerator also requires clock generation and JTAG access.
The accelerator is relatively compact, consuming just 9mm2 in a 14nm process. The major components are the PE array (3.2mm2, 36%), the SFUs and Y-dimension SRAM (0.6mm2, 7%), and the X-dimension and L2 SRAMs (2.5mm2, 28%), with the remainder for control, test, interfaces, and unused whitespace. It runs at 1.5GHz using a 0.9V power supply and delivers a peak of 1.5TFLOP/s for training with 192GB/s read + 192GB/s write bandwidth from each of the three scratchpads. The inference datatypes supported suggest a research project. While FP16 multiplication is a reasonable choice for training neural networks, typically accumulation is performed using FP32 to avoid losing dynamic range. For inference, there have been papers on binary or ternary data, but they are firmly in the realm of research, while most products focus on 8-bit integers. IBM shared utilization details for training ResNet-50, a deep convolutional network for object detection. The accelerator was able to exceed 80% utilization of the cores when training on a single example image (batch size = 1), and over 95% when training with a batch of four or eight images. While those utilization figures are impressive, the accelerator is tiny and offers just a fraction of the performance of a modern CPU or GPU, and utilization is typically very high for such small systems.
As a research project, the absolute performance is not terribly important. However, the key architectural choices are quite interesting. IBM’s processor uses a large array of very small processor cores with very little SIMD. This architectural choice enables better performance for sparse dataflow (e.g., sparse activations in a neural network). In contrast, Google, Intel, and Nvidia all rely on a small number of large cores with lots of dense data parallelism to achieve good performance. Related, IBM’s PEs are arranged in a 2D array with a mesh network, a natural organization for planar silicon and a workload with a reasonable degree of locality. While Intel processors also use a mesh fabric for inter-core communication, GPUs have a rather different architecture that looks more similar to a crossbar. The IBM PEs are optimized for common operations (e.g., multiply-accumulate) and sufficiently programmable to support different dataflows and reuse patterns. Less common operations are performed outside of the core in the special function units. As with many machine learning processors, a variety of reduced precision data formats are used to improve throughput. Last, the processor relies on software-managed data (and instruction) movement in explicitly addressed SRAMs, rather than hardware-managed caches. This approach is similar to the Cell processor and offers superior flexibility and power-efficiency (compared to caches) at the cost of significant programmer and tool chain complexity. While not every machine learning processor will share all these attributes, it certainly illustrates a different approach from any of the incumbents – and more consistent with the architectures chosen by start-ups such as Graphcore or Wave that solely focus on machine learning and neural networks.
Discuss (12 comments)