For me, SC19 was about the fusion of machine learning and scientific computing. I learned about new technologies from Nvidia, Graphcore, and Cerebras Systems and spoke on a panel about the role of MLPerf in benchmarking HPC systems for machine learning and the many lessons learned.
Intel’s Haswell CPU Microarchitecture
Intel’s Haswell CPU is the first core optimized for 22nm and includes a huge number of innovations for developers and users. New instructions for transactional memory, bit-manipulation, full 256-bit integer SIMD and floating point multiply-accumulate are combined in a microarchitecture that essentially doubles computational throughput and cache bandwidth. Most importantly, the microarchitecture was designed for efficiency and extends Intel’s offerings down to 10W tablets, while maintaining leadership for notebooks, desktops, servers and workstations.
Analysis of Haswell’s Transactional Memory
Intel’s upcoming Haswell microprocessors include transactional memory and hardware lock elision that are exposed through the Transactional Synchronization Extensions or TSX. In this article, I discuss TSX and predict the implementation details of Haswell’s transactional memory and expected adoption across the industry, based on my previous experience.
Memory Bandwidth and GPU Performance
Memory bandwidth is a critical to feeding the shader arrays in programmable GPUs. We show that memory is an integral part of a good performance model and can impact graphics by 40% or more. The implications are important for upcoming integrated graphics, such as AMD’s Llano and Intel’s Ivy Bridge – as the bandwidth constraints will play a key role in determining overall performance.
Predicting AMD and Nvidia GPU Performance
Modern graphics processors are incredibly complex, but understanding their performance is essential, as they become an increasingly important component of computer systems. In this report, we use a set of benchmark results to build accurate performance models for AMD and Nvidia GPUs. We verify that our model can predict performance within roughly 6-8% for many desktop graphics cards and show how Nvidia’s microarchitecture and drivers achieve roughly 2X higher utilization than AMD’s VLIW5 design.
Introduction to OpenCL
A critical question for GPU computing is how programmers will interface with the underlying hardware. Users have the choice between three APIs: Nvidia’s proprietary CUDA, Microsoft’s DirectCompute and OpenCL. Of the three, OpenCL has garnered the most enthusiasm across the PC ecosystem (e.g. AMD, IBM, Intel and Nvidia) and the mobile and embedded market (e.g. ARM and Imagination Technologies). While still a nascent technology, OpenCL is very popular because it is an open, industry standard that promises compatibility on a huge variety of hardware. This article explores aspects of OpenCL, including the early development efforts at Apple and the standard itself, including the execution and memory model.
Parallelism at HotPar 2010
The 2010 HotPar workshop had a variety of papers focusing on the software and programming aspects of parallelism. Highlights include parallelization of the Firefox browser, Michael McCool’s approach to parallel building blocks and Electronic Art’s Cascade system for handling state in video games. More hardware centric topics include a controversial view that parallelism is irrelevant and the limits of GPU performance.
PhysX87: Software Deficiency
PhysX is a key application that Nvidia uses to showcase the advantages of GPU computing (GPGPU) for consumers. PhysX executing on an Nvidia GPU an improve performance by 2-4X compared to running on a CPU from Intel or AMD. We investigated and discovered that CPU PhysX exclusively uses x87 rather than the faster SSE instructions. This hobbles the performance of CPUs, calling into question the real benefits of PhysX on a GPU.
MAQSIP-RT: An HPC Benchmark
In this article, we test out a new HPC benchmark from one of our readers on an Istanbul server from Supermicro. MAQSIP-RT is a forecasting and analysis package that is commonly used throughout the weather and atmospheric chemistry communities. In our first run, we take a look at scalability and performance and find a benchmark that suits many of our needs.
Performance Analysis for Core 2 and K8: Part 1
We analyze the performance of a 2.9GHz 65nm Core 2 Duo (aka Conroe) vs. a 2.8GHz 90nm K8, using performance tools from Intel and AMD. With VTune and Code Analyst we are able to extract performance counter information such as IPC, uop density, cache miss rates, branch mispredictions, memory accesses and other data that we use to explain the difference in performance between the two CPUs.
The first part focuses on characteristics which are common across both CPUs, while later parts will focus on microarchitecture specific counters.