For me, SC19 was about the fusion of machine learning and scientific computing. I learned about new technologies from Nvidia, Graphcore, and Cerebras Systems and spoke on a panel about the role of MLPerf in benchmarking HPC systems for machine learning and the many lessons learned.
At VLSI 2018, researchers from TDK and TSMC described advances in Magneto-resistive memory (MRAM). TDK focused on new materials to improve writing for low-voltage MRAM cells at small geometries. A team from TSMC showcased circuit techniques to improve read performance of MRAM arrays despite process variability and a small read window.
IBM presented a neural network accelerator at VLSI 2018 showcasing a variety of architectural techniques for machine learning, including a regular 2D array of small processing elements optimized for dataflow computation, reduced precision arithmetic, and explicitly addressed memories.
Intel will offer 3DXP-based DIMMs (previously codenamed Apache Pass) that use the DDR4 interface on the next-generation Cascade Lake server processor. The first DIMMs will be available in 128GB, 256GB, and 512GB capacities and work with a new software architecture for persistent memory. Intel and its partners have enabled the new persistent memory programming model for Java, Linux, VMware, and Windows and many customers are eagerly awaiting the non-volatile, high-capacity memory for in-memory databases and other applications.
Intel’s 22FFL (FinFET Low-power) is a variant of their existing 22nm process that is aimed at low-cost, extremely low-power, and analog/RF applications. 22FFL relaxes the ground rules to reduce the need for double patterning, thereby cutting costs. At the same time, Intel’s engineers essentially backported the second and third generation FinFETs from the 10nm and 14nm processes to 22FFL, improving performance and power efficiency with superior fin geometry and workfunction metals. Intel also created a large library of digital and analog transistors and passive components.
Previously, Apple’s iPhones and iPads used PowerVR GPUs from Imagination Technologies for graphics. Based on our analysis, Apple has created a custom GPU that powers the A8, A9, and 10 processors, shipping in the iPhone 6 and later models, and some iPads. Using public documents, we demonstrate that the programmable shader cores inside Apple’s GPU are different from Imagination Technologies’ PowerVR and offer superior 16-bit floating-point performance and data conversion functions. We further believe that Apple has also developed a custom shader compiler and graphics driver. The proprietary design enables Apple to deliver best-in-class performance for graphics, and other tasks that use the GPU, such as image processing and machine learning.
Starting with the Maxwell GM20x architecture, Nvidia high-performance GPUs have borrowed techniques from low-power mobile graphics architectures. Specifically, Maxwell and Pascal use tile-based immediate-mode rasterizers that buffer pixel output, instead of conventional full-screen immediate-mode rasterizers. Using simple DirectX shaders, we demonstrate the tile-based rasterization in Nvidia’s Maxwell and Pascal GPUs and contrast this behavior to the immediate-mode rasterizer used by AMD.
On the eve of the 50th anniversary of Moore’s Law, the future of silicon CMOS is an open question. With rising costs and uncertain benefits, some semiconductor companies have questioned the wisdom of pursuing further scaling. I predict that Intel’s 10nm process technology will use Quantum Well FETs (QWFETs) with a 3D fin geometry, InGaAs for the NFET channel, and strained Germanium for the PFET channel, enabling lower voltage and more energy efficient transistors in 2016, and the rest of the industry will follow suit at the 7nm node.
My favorite paper from the ISSCC processor session describes an adaptive clocking technique implemented in AMD’s 28nm Steamroller core that compensates for power supply noise. Initial results show a 10-20% decrease in power consumption from reducing the voltage, with no loss in performance. This elegant technique is likely to be adopted across AMD’s entire product line including GPUs, x86 CPUs, ARM-based CPUs, and other critical blocks in highly integrated SoCs.
Jaguar is AMD’s first 28nm processor, a compact 3.1mm2 design that targets 2-25W devices. It is a derivative of the earlier 40nm Bobcat, a fully out-of-order two issue design, with significant improvements in instruction set architecture and implementation. Some of the highlights include support for AVX, wider 128-bit datapaths, and a higher performance L2 cache. Jaguar is already shipping in several AMD SoCs targeted at tablets, notebooks, microservers, and desktops. However, it is far more prominent as the CPU powering the Sony Playstation 4 and Microsoft Xbox One.