Persistent Memory Software Model
The software architecture is equally important as the switch from PCIe to a DDR memory bus, and Intel has spent considerable resources moving the industry forward. 3DXP DIMMs are just one example of the broader class of persistent memory (or persistent DIMMs). For example, persistent memory exists today in the form of the NVDIMM-N standard and Micron sells 8GB and 16GB NVDIMMs that use regular DDR4 DRAM backed by non-volatile NAND flash and capacitors for persistence.
Intel and the industry have converged on the term persistent memory and also created a standard programming model through SNIA. Persistent memory is different than storage and also different from main memory, although it can be treated as either with particular software. As part of ACPI 6.0, the NFIT tables are a standard mechanism for enumerating persistent memory (e.g., 3DXP DIMMs or NVDIMMs), separate from main memory. As illustrated in Figure 1, the persistent memory programming model offers four different options for application developers to enable a wide variety of uses.
Intel has already extended the x86 ISA with two new instructions to accelerate persistent memory in the Skylake server family. CLFLUSHOPT flushes a specified line from all caches (instruction and data), by invalidating the line. If the cache line was dirty, the line is written to memory, ensuring that memory has the most up-to-date version. Writes to different cache lines can be reordered around CLFLUSHOPT, whereas the older CLFLUSH serializes all writes. As a result, CLFLUSHOPT is around 10X faster than the older CLFLUSH instruction for more than 4KB of data. The second instruction, CLWB, forces the processor to write back a specified cache line (if it has been modified) to memory, but the processor can retain the line in the caches. In essence, it forces a cache line into a clean state (E.g., E, S, or F in the MESIF procotol) by writing to memory if necessary. Both of these instructions force data to be written back to persistent memory from the cache at high speed. They are commonly combined with a single SFENCE instruction to guarantee that the writeback becomes globally observable.

Figure 1. SNIA Programming Model for 3DXP
In a traditional environment, applications access storage through two mechanisms. First, they can use APIs to open or close a file. The API calls are handled by the file system, which will check for permissions and then request for the storage driver to copy the data of the file into memory that is mapped to the application. Second, an application can directly access block storage, bypassing the filesystem. This option is popular for databases and other applications that can directly manage storage more efficiently than a generic filesystem.
In the simplest and lowest performance approach, a NVDIMM driver reserves and manages portions of the persistent memory as block storage using a block translation table. The NVDIMM block storage can be accessed directly by applications or managed by a conventional filesystem. As block storage, the access granularity is relatively large (e.g., 512B or 4KB). Since accesses go through the NVDIMM driver, the latency and bandwidth is also relatively slow. However, the system changes are quite minimal and this approach works with legacy filesystems and applications.
For the best performance, persistent memory is used via direct access (referred to as DAX in Linux and Windows). With DAX, a region of persistent memory is directly mapped into the address space of an application (e.g., using mmap() on Linux). Subsequent accesses are performed by fine-grained load and store instructions, which read from and write to the persistent memory. Since the accesses may hit in the processor caches, instructions like CLWB and CLFLUSHOPT ensure that stored data is written to the persistent memory. A single x86 memory access can range from a single byte up to 64B, although only 8-byte aligned stores are guaranteed to be atomic. Since the read/write path bypasses software entirely, DAX accesses are low latency with high throughput.
The easiest way for a developer to use DAX is through libraries. The open-source Persistent Memory Developer Kit (PMDK) is a set of nine highly-tuned libraries that includes allocators, a transactional object store, a persistent memory log file, and other functions. PMDK is currently production quality for Linux, although some APIs are experimental. A production quality Windows build is available, and it is possible that Microsoft will include it in future versions of Windows. While PMDK was developed by Intel with 3DXP DIMMs in mind, it is hardware agnostic and will work with other solutions (e.g., NVDIMM-Ns). Separately, persistent collections for Java builds on PMDK to offer a variety of persistent classes (e.g., linked lists) for developers. The memkind library was originally designed to handle the fast memory in Knights Landing, but now lets developers simply treat the persistent memory as main memory that is slower, but higher capacity than DRAM.
In addition to accessing persistent memory as a byte-addressed memory, developers can use standard file APIs with a persistent memory-aware filesystem. The filesystem will then handle the unusual nature of the persistent memory. In particular, the file system will directly access the backing persistent memory, bypassing the regular system page cache.