NVLog: an Elegant Approach to Integrate NVM

January 24, 2025

Research

Authors:

Article shepherded by:

Rik Farrow

Block devices, including hard disk drives (HDDs) and solid-state drives (SSDs), have become the absolute dominant form of external storage. From hardware interfaces to system software stacks to user applications, everything is already well-established on block devices. Therefore, when non-volatile memory (NVM) emerged, its faster access speed and byte-level access granularity posed significant challenges for system design—everything seemed to need a complete redesign to accommodate these new features. But is redesigning everything really the only solution? Unlike most previous works based on NVM, we show [7] how to integrate NVM seamlessly and painlessly into the existing block device software stack, providing transparent acceleration to user applications without requiring any changes.

Storage Devices and File Systems

From punch cards to magnetic tapes and then to disks, external storage devices with persistence have continuously evolved toward miniaturization and higher speeds to meet the need for system state and data preservation in the event of power loss. Among these developments, the emergence of solid-state drives (SSDs) marks a groundbreaking milestone: by eliminating mechanical movement in storage devices, SSDs reduce access latency from the millisecond range to the microsecond range and significantly enhance random access performance.

Though storage devices have become faster and faster, accessing disks is still not as straightforward as accessing memory. Disks typically use larger read/write granularity (blocks, usually >= 512B) and have access latencies several orders of magnitude higher than that of DRAM. The development of file systems and associated infrastructure was aimed at addressing these challenges. Disk file systems are generally responsible for managing the layout of file data and metadata on the disk, while providing a certain level of crash consistency guarantees. In addition, systems implement caches (such as the file-backed page cache in Linux) in DRAM for disk files, significantly improving read and write efficiency.

Among the various system calls related to file systems, synchronous operations (such as sync, fsync, fdatasync, etc.) are particularly interesting. They emerged with the advent of DRAM file caches—while caching improves file system performance, it also means that writes may not immediately become persistent. To explicitly guarantee that written data has been persisted, users can invoke synchronous operations. This effectively issues a barrier that blocks the current process until the write is committed to disk. Of course, this also means that the cache is bypassed during the synchronous operation, and your process must endure the slow disk I/O. Synchronous operations are crucial for applications like databases that require consistency guarantees, and it is equally important for people like me who habitually save documents frequently.

Non-volatile Memory

As a new form of persistent storage, Non-Volatile Memory (NVM) offers nanosecond-level access latency and can be accessed at the byte granularity, much like DRAM. As a result, traditional storage software stacks designed for larger-granularity, slower devices seem ill-suited for NVM. This novel device has garnered significant attention: researchers have attempted to build new file systems and databases using NVM, or to extend memory space with it (since its single-chip capacity exceeds that of DRAM). For example, in the case of file systems, recent work on NVM largely seeks to bypass the DRAM cache and treat NVM as a directly accessible storage device (e.g., Ext4 DAX [1], NOVA [6], PMFS [3], etc.). By reducing the two-write operation to DRAM and NVM into a single write, these approaches lower the persistence latency of data. Such efforts have shown promising performance in certain workloads.

However, as we have observed, persistent memory has not seen widespread deployment, except in a few data centers, and Intel discontinued its Optane PMEM [2] a few years ago. Beyond commercial factors, we believe a key reason for the lack of large-scale success of persistent memory is that it has not been "painlessly" integrated into current systems. New applications designed for NVM often target two key characteristics: first, NVM's high-performance persistence capability, which has led to the development of NVM-based file systems (NVM FS) and databases (NVM DB); second, its large capacity and low cost-per-byte based on DIMM slots, which has inspired tiered memory research.

For the first category of work, although NVM is fast, its performance is still several times lower than DRAM. As a result, while approaches like DAX FS optimize synchronous writes on NVM by bypassing DRAM, they significantly sacrifice the performance of asynchronous writes and reads. More importantly, the persistence process in almost all modern applications is optimized for asynchronous read/write operations, introducing synchronous operations only when consistency must be ensured. Therefore, the performance of NVM FS has not met expectations in most existing applications. For the second category of work, due to Intel's relatively high pricing for Optane, its cost advantage is not obvious, leading to its gradual replacement by memory expansion solutions based on RDMA or CXL.

Next, we will focus on discussing NVM's high-speed persistence capabilities.

The Proper Way to Integrate NVM

Although NVM offers relatively high persistence speeds, its performance is still lower than that of DRAM. Therefore, we believe that approaches like NOVA, which sacrifice conventional read/write performance to provide optimal synchronous write performance, are likely only suitable for a few specific use cases, such as write-heavy databases.

At the same time, while NVM capacities typically range from several hundred GB to a few TB (e.g., with Optane), a single disk can easily offer tens of TB, and disk arrays can provide PB-level capacities at much lower cost-per-byte. Therefore, another issue with replacing disk file systems with NVM file systems is the reduction in available capacity, the significant increase in costs, and the overhead of migrating large amounts of existing data to a new file system.

We conducted a performance analysis comparing current disk file systems and NVM file systems. The results in Figure 1 show that when data access is accelerated by DRAM (cache hit), disk file systems outperform NVM file systems. Generally, after an application has been running for a while, the cache hit rate tends to be high. As a result, the advantages of the NVM file system are mostly limited to synchronous write operations.

Figure 1. The throughput on different file systems and different storage devices, tested with FIO. C and H suffixes indicate that the page cache is cold (cache miss) or warm (cache hit). S means sync writes. Reads are not affected by sync.

Although synchronous write performance is crucial for applications like databases, asynchronous writes and reads often play a more significant role in practical use cases. Considering that disks, DRAM, and the current storage software stacks built on them still offer broad advantages in many tasks, we believe that retaining the existing mature disk file systems and leveraging NVM to transparently accelerate their synchronous writes may be the best way to seamlessly and painlessly integrate NVM into current systems.

Using NVM to accelerate disk file systems is not without precedent: SPFS [5] stacks a new NVM file system on top of a disk file system and predicts synchronous requests to transfer potentially synchronized data to NVM. P2CACHE [4] provides a strongly consistent file cache by writing all data simultaneously to both DRAM and NVM, thereby eliminating disk I/O for synchronous requests. However, the performance of these approaches may not fully meet expectations.

SPFS optimizes synchronous writes based on predicting consecutive synchronous requests, which makes it difficult to provide effective acceleration when faced with infrequent and irregular synchronous requests common in many applications. Furthermore, once synchronous writes are offloaded to NVM, the upper-level NVM file system takes over subsequent reads and writes for this data, meaning that the performance of subsequent asynchronous reads and writes will be slower than the performance provided by the DRAM cache in the original disk file system.

P2CACHE retains the fast path for reading data from DRAM; however, it writes all data, whether synchronous or not, to both DRAM and NVM simultaneously. Since NVM write performance is lower than DRAM, the system’s performance actually degrades for the majority of applications that primarily perform asynchronous writes.

Meanwhile, both SPFS and P2CACHE are implemented in a manner similar to independent file systems: they establish and manage indexes for data at runtime, and once data is persisted to the upper-level NVM, it no longer interacts with disk data and can only be migrated to the underlying disk file system periodically and at a coarse granularity. We believe these designs fail to leverage NVM to transparently and efficiently accelerate existing disk file systems. Instead, they are merely another attempt, like NOVA, that optimizes synchronous writes but may slow down other read/write requests.

NVLog: An Attempt to Elegantly Accelerate Disk File Systems

We believe that simply taking over data from the disk is not a wise choice. Our goal is to precisely accelerate the synchronous write operations that slow down the file system, while maintaining the high performance provided by the DRAM cache for other operations. At the same time, this acceleration should be transparent: it should not require changes to user programs or to the time-tested, robust disk file systems. However, this is not an easy task. After analyzing SPFS and P2CACHE, we have drawn the following two insights:

First, the DRAM cache is sufficient and efficient to serve applications. Therefore, when persisting synchronized data to NVM, the focus should be on the efficiency of recording, rather than data retrieval. Due to neglecting this, both P2CACHE and SPFS have to create an index for data on NVM for subsequent reads, and have difficulty reducing the space usage on NVM.

Second, establishing a well-defined write timing between NVM and disk is crucial for ensuring crash consistency while minimizing the amount of data written to NVM. Due to neglecting this, SPFS and P2CACHE are forced to also redirect async writes to NVM when absorbing sync writes, in order to avoid inconsistencies between the data from sync writes (to NVM) and async writes (to disk).

Drawing inspiration from database design, we believe that using NVM as a write-ahead log (WAL) for disk file systems is a more efficient solution. As shown in Figure 2, we designed NVLog to intercept (and only intercept) synchronous calls before the file system and write the synchronous data to NVM. Then, we transform synchronous write requests into asynchronous ones. This way, any (cache hit) operations on the file system no longer need to wait for disk I/O: for reads and asynchronous writes, data can still be provided by the DRAM cache; for synchronous writes, data is written to both DRAM and NVM in the foreground, while disk writes are offloaded to the background.

Figure 2. NVLog Architecture. Figure shows the position and data flow of NVLog inside the Linux kernel.

We focus on ensuring the post-crash persistence of data for synchronous operations. As such, we record synchronous events in NVM in an append-only manner without indexing the data. After a crash, we simply replay the recorded events to restore the data that was supposed to be on disk. This is a key distinction between NVLog and SPFS/P2CACHE: NVLog serves as a lightweight WAL for file system synchronous writes, while SPFS and P2CACHE are inherently heavier file systems. By getting rid of indexing, NVLog provides a higher performance; by only logging synchronous data and allowing us to reclaim expired records on NVM after data is written to disk, NVLog requires less NVM space compared to other approaches.

While the write-ahead log concept may seem simple, a key difference from database WALs is that NVLog must account for the timing relationship between NVM and underlying disk writes. For databases, it is possible to strictly enforce writing to the WAL before writing to the data area. However, since NVLog is designed as a transparent "intermediate layer" to both the user and the file system, we cannot modify the user interface or the mechanisms for writing back to the disk. Furthermore, because user asynchronous writes, synchronous writes, and DRAM cache flushes to the disk may occur in any order, the data on the disk may be messed up if we simply replay all the NVM records to the disk. We hence provide a mechanism to ensure that data recovered from NVM is always more recent than the disk version, preventing the risk of older data overwriting newer data.

In addition to the designs mentioned above, we also explore efficient log structure and fine-grained synchronous writes in NVLog. We encourage interested readers to refer to our FAST '25 paper [7] for more details.

Evaluation

We implemented a prototype of NVLog and conducted a series of experiments based on it. The complete experimental results can be found in our FAST '25 paper; here, we present two representative experiments.

First, to demonstrate the applicability of NVLog across a wide range of application scenarios, we designed experiments with varying read-to-write ratios and synchronous-to-asynchronous write ratios under different file systems. We compared the performance of NOVA, SPFS, and NVLog (AS). Note that NVLog (AS) refers to using NVLog but forces all writes to be synchronous, which somewhat represents the performance of P2CACHE.

The results are shown in Figure 3. Thanks to our DRAM-NVM cooperative design, NVLog outperforms NVM FS, disk FS, and NVM-based FS accelerators in most cases. In non-sync workloads, by leveraging the DRAM page cache, NVLog performs similarly to its baseline disk FS, achieving speeds up to 3.72x, 2.93x, and 1.24x faster than NOVA, NVLog (AS), and SPFS, respectively. In partial-sync workloads, NVLog outperforms the disk FS, NOVA, and SPFS by up to 4.44x, 2.62x, and 324.11x, respectively. The results show that NVLog consistently maintains a good balance between DRAM and NVM access across various sync levels. Additionally, it is evident that NVLog is the only solution that does not introduce any slowdown to the legacy disk FS.

Figure 3. Read, write, and sync mixed tests under 4KB random access. AS: all writes are forced to be synchronized.

Next, we tested NVLog's space usage with an 80GB fully synchronous write workload, and the results are shown in Figure 4. With garbage collection enabled, NVLog's space usage never exceeded 22GB and gradually dropped to near zero after the experiment finished. This temporary and relatively small space footprint is a result of our log-based design. In contrast, using NVM in the form of a file system would require NVM space equal to the entire volume of written data, i.e., 80GB. Our lightweight design demonstrates better suitability, especially in the context of Optane's discontinuation and the potential capacity limitations of other alternative products.

Figure 4. NVLog capacity usage and GC performance. The figure shows the NVM usage and the throughput of NVLog with or without garbage collection.

Conclusion and Discussion

In this paper, we propose NVLog, which uses NVM as a write-ahead log (WAL) for file system synchronous operations, enabling transparent acceleration of synchronous writes while preserving the benefits of DRAM caching for asynchronous writes and reads. Thanks to our efficient design, NVLog achieves higher performance and lower space usage across a broader range of application scenarios compared to previous work. More importantly, unlike prior solutions, NVLog does not introduce any slowdown to existing applications in any scenario. We believe this "painless" use of NVM is more likely to be widely accepted by users of legacy storage systems.

Appendix

References:

[1] DAX. https://www.kernel.org/doc/Documentation/filesystems/dax.txt.

[2] Intel® Optane™ Persistent Memory. https://www.intel.com/content/www/us/en/products/docs/memory-storage/opt....

[3] Subramanya R. Dulloor, Sanjay Kumar, Anil Keshava-murthy, Philip Lantz, Dheeraj Reddy, Rajesh Sankaran,and Jeff Jackson. System software for persistent memory. In Proceedings of the Ninth European Conference on Computer Systems - EuroSys ’14, pages 1–15, Amsterdam, The Netherlands, 2014. ACM Press.

[4] Zhen Lin, Lingfeng Xiang, Jia Rao, and Hui Lu. P2CACHE: Exploring Tiered Memory for In-Kernel File Systems Caching. In 2023 USENIX Annual Technical Conference (USENIX ATC 23), pages 801–815, 2023.

[5] Hobin Woo, Daegyu Han, Seungjoon Ha, Sam H. Noh, and Beomseok Nam. On Stacking a Persistent Memory File System on Legacy File Systems. In 21st USENIX Conference on File and Storage Technologies (FAST 23), pages 281–296, 2023.

[6] Jian Xu and Steven Swanson. NOVA: a log-structured file system for hybrid volatile/non-volatile main memories. In Proceedings of the 14th Usenix Conference on File and Storage Technologies, FAST’16, pages 323–338, USA, February 2016. USENIX Association.

[7] Guoyu Wang, Xilong Che, Haoyang Wei, Shuo Chen, Puyi He, and Juncheng Hu. Boosting File Systems Elegantly: A Transparent NVM Write-ahead Log for Disk File Systems. In 23rd USENIX Conference on File and Storage Technologies (FAST 25), February 2025. USENIX Association. https://www.usenix.org/conference/fast25/presentation/wang

Article Categories:

Filesystem/storage

Operating Systems

Linux

Last updated January 24, 2025

Authors:

Guoyu Wang is a Ph.D. student in the College of Computer Science and Technology at Jilin University. His broad research interests include operating systems, storage systems, and distributed systems.

[email protected]

Juncheng Hu received his bachelor’s degree and doctor of engineering degree from Jilin University in 2017 and 2022. He is currently a lecturer at Jilin University. His research interests include computer architecture, storage engine, and IoT big data processing.

[email protected]