Getting Zoned Storage out of the Friend Zone

Interview: Matias Bjorling and George Amvrosiadis on ZNS SSDs

January 17, 2022

Deployed System

Authors:

Matias Bjørling, George Amvrosiadis, Bill Jannen, Rik Farrow

Article shepherded by:

Bill Jannen, Rik Farrow

Zoned Namespace (ZNS) SSDs are an emerging class of zoned storage devices. Unlike conventional SSDs, ZNS SSDs redistribute some of the responsibilities traditionally held by conventional SSDs, delegating them to the host. This redistribution of functionality creates both challenges and opportunities. Bill Jannen and Rik Farrow had the good fortune to sit with two of the authors of ZNS: Avoiding the Block Interface Tax for Flash-based SSDs and to hear their impressions on the ZNS standard. Throughout our conversation, we touched on the past, present, and future of ZNS. Below are some of the highlights from that discussion.

This interview has been edited for clarity and length.

The SSD Status Quo

Bill Jannen: ZNS represents a new interface for SSDs, but it might be a good idea to start by explaining what the previous SSD interface was. What were we dealing with before ZNS came along?

George Amvrosiadis: For decades, storage devices have relied on the block interface. A block device is exposed to the host and the operating system as a one-dimensional array of logical blocks that can be read or written atomically. It doesn’t matter what storage device hardware actually looks like.

A key, unwritten assumption about the block interface is that if you write or overwrite a block address, these operations take the same effort from the storage device. That assumption is no longer universally true.

BJ: When was that true?

GA: For older hard disks where you had to seek to a specific sector and write it. When you have to overwrite the sector, you seek to it again and overwrite it again. You always have to pay the same cost for a data access: the seek cost, the rotational latency, and the data transfer time.

Rik Farrow: And it’s always a sector-size block at a time. You’re not writing a stream.

GA: The property that I mentioned before doesn’t hold for new hard disks because, in our efforts to try to pack more bits on a platter, we started partially overlapping tracks—I’m referring to Shingled Magnetic Recording (SMR). Overwriting a sector in that case—which is the smallest unit of data access on a hard disk—requires erasing multiple tracks’ worth of data. So already, you see the difference between writing and overwriting: overwriting requires a lot more work from the device.

SSDs have always dealt with asymmetry between interface operations and work performed by the device. SSDs allow writes to occur at the granularity of pages, but pages themselves can only be erased in large groups of erase blocks in order to be filled with new data. Similar to SMR hard disks, updating a page’s data in place for every overwrite would be too expensive. The way we have solved this problem for SSDs is by writing the new data in a new location, and then keeping track of this new location.

RF: And that’s the FTL, flash translation layer. FTLs have traditionally been implemented in SSD firmware; FTLs perform the management tasks described above internally to the drive so that users can address an SSD as if it were a random-access block device.

Part of the problem is that there are things going on behind the scenes. When you erase a block, you may have in-use data still in that block, and that data has to be copied to other blocks and put in the map and so on. And that means that you don’t get predictable performance because you don’t know what’s going on inside the SSD.

GA: That’s absolutely right. The goal of the FTL is to close the semantic gap between what the interface provide and the characteristics of the hardware.

BJ: In your paper, you refer to a “block interface tax”. Are these the types of overheads that this tax is describing? You have this mapping/translation that the FTL needs to keep track of, garbage collection, and copying blocks. Is there anything else that falls into this block interface tax?

GA: Yes. The data movement that Rik described earlier is the problem. We call it a tax because it reduces the device’s efficiency across multiple dimensions.

One of the dimensions is what Rik mentioned: performance unpredictability. You have periodic garbage collection, and you have to reclaim all stale blocks. This can interfere with performance, and it can drive up tail latencies, which have become the bane for a lot of people who are deploying conventional SSDs.

Another dimension is cost per gigabyte. You have to allocate additional capacity in order to stage data that you are moving around. This was a suprise for me—I learned this from Matias—you need as much as 23% of extra capacity on write-heavy enterprise devices, just to do this data staging work. Imagine being able to get rid of this process: all of this extra capacity can now be exposed to the user. If you could do that, then you could also get rid of additional DRAM required to keep mappings of physical and logical addresses, which would further decrease hardware cost. While NAND media cost is the bulk of the cost of high capacity SSDs, DRAM is another very expensive component for those devices.

The Quest to Close the SSD Semantic Gap

BJ: FTLs are a real boon for applications that want the speed of NAND flash and compatibility with the standard block interface. But paying the “block tax” is obviously not ideal. Years of work were sunk into efforts to mitigate the tax, and the standardization of ZNS is the culmination of much of that work. Can you share some highlights from the path towards what has become ZNS today?

Matias Bjørling: Before ZNS, there was a history of hyperscalers developing and deploying their own custom SSD solutions with their own tradeoffs. However, as these solutions were somewhat similar from the outside, around 7 or 8 years ago, there was a slight movement towards coalescing these implementations into a single storage interface called Open-Channel SSDs, which turned into a public specification that allowed hyperscalers, SSD vendors, and the software developers to collaborate together to build a thriving eco-system.

However, while I would like to say that was the plan all along, Open-Channel’s beginning was much more simple. As an aspiring PhD candidate, I wanted to do research on SSDs, and at that point they were “black-boxes”, and I wanted to make them a “white-box”. My project goal was to have a way where we could do research on flash translations layers on SSDs, optimize apps based on SSDs, and so on. It was only later after I graduated and got to combine the work with real hardware that the project began to grew.

Over the years the Open-channel SSD project matured and started to gain interest from multiple large-scale cloud vendors like Microsoft, Alibaba, and others. We got to a point where it was great that we had a specification, but it wasn’t ratified by any standards organization. Together with Microsoft, we started project Denali, where we began with the Open-Channel specification and worked on it in a group with the goal of later ratifying it as a standard document. But that didn’t happen because halfway through, we decided to go into NVMe and standardize it the proper way. And that is what became what ZNS is today. Albeit ZNS has several improvements over the Open-Channel SSD interface, which by design were quite host-side heavy in some regards. One example I’m very proud of is that we got the wear-leveling responsibility of the media to stay on the device-side, which in Open-Channel SSD interface were to be managed by the host. As we had learned quite a bit from the original interface, we took care to not do the same mistakes that were in the original Open-Channel SSD specification.

Today everyone is kind of like “that’s just how it's supposed to be”, but from 8 years ago with the Open-channel SSD project, people were like, “Why are you exposing these kinds of things off the SSD?” People would say, “We’re not going to do any of these things. That’s never going to happen”. But when Microsoft endorsed, things did start to happen. The perception changed. So we did a lot of learning throughout those many years, and ZNS came from all of that knowledge. We started from a clean slate and incorporated all our learnings along the way to create a robust and future-proof interface.

Now we get to the questions of how do we split the interface? What needs to be in the drive and what needs to be in the host? Initial Open-channel went to one side—all the way out where we exposed everything. Then we pulled back a little bit, and updated Open-Channel to 2.0 which let the SSD on much of the media management responsibility. We tried to find out what is the right interface, the right level of abstraction. If we look at NAND media, managing the reliability is actually quite complex in the sense that, if you have one vendor’s media, it’s different from another vendor’s media, and they need different kinds of functionality; different things have to be done to maintain the reliability of the media. All SSDs do read scrubbing, for example, but they might do it different ways and at different speeds, depending on what the media is capable of. There are different ways that they’re writing to the media—there are things called read and write disturbs, and to minimize disturbs, you write to the media in a specific order. All of those things are different depending on the vendor and the generation of the media—and even within the same generation of the media, depending on your use case. There are a lot of these tradeoffs. While, yes, it would be great to expose all of that to the host, we are in the situation where, since the controller and the one making the SSD know what the media is, what it does—they’ve charactarized it—they, and the controller, have all this knowledge of the media, but the host doesn’t.

The host can only do a generic recovery of data. So whenever the host wants to do something to recover data, it brings out the “big hammer”. Sometimes the “big hammer” isn’t necessary to get the data, but the host doesn’t know that. The only thing it knows is that, well, this bit couldn’t be read. And yes, you could extend the host interface so that you have that kind of information, but then you’re adding complexity and this was maybe OK in the olden days when SSDs were relatively slow compared to SSDs today. When compared to the next generation, there are SSDs that can do more than a million I/Os in a second and read 14 gigabytes per second; if the host has to think too much about each I/O, it cannot do anything meaningful. So while it would be nice to have that kind of exposure to the host, in practice, you’re hurting the overall performance with that kind of access. You actually want someone to take care of it for you, and not only because it knows more about the media, but also because it’s more efficient overall to have it there. Each has their own responsibility.

BJ: I find that whole discussion very interesting because the paper puts a lot of focus on getting good performance by reducing this block tax, but you just described a whole different discussion of where do we draw the line between the role of the firmware and the role of what the host is responsbile for. I think a lot of this gets back to the question of what is the interface, what are the set of abstractions that we want to expose to the application writers so that they can manage the important parts of the SSD—or at least the parts that will help us eliminate this block tax? Could you take a second and describe some of the abstractions? As a programmer, if I’m excited to get started with a ZNS SSD, how do I start programming? What do I have available to me?

The ZNS Interface

GA: The Zoned Namespaces interface, or ZNS, groups a device’s logical block addresses into zones. Those zones can only be written sequentially from beginning to end, meaning you can only append to a zone and they must be erased before you start over. This encapsulates the hardware limitations that we’ve talked about before, like the fact that you need to erase an entire block before overwriting any of the pages in it. ZNS also encapsulates garbage collection as an interface because every time you erase a zone, you are essentially garbage collecting the entire thing.

We should probably also say something about the types of writes that ZNS allows. I mentioned that zones are written from beginning to end. There are essentialy two operations that can be used to achieve that. You could say “write to the next LBA” (logical block address), but of course the LBA you’re writing to needs to be a higher address than the one before. You can’t go back and write a previous address once that’s done. The other operations is zone append.

BJ: So overwrites are completely off the table?

GA: Absolutely.

MB: Yes, George is right. What has been standardized today is that, for ZNS, you need to write sequentially. If you want to write to a specific page again, you need a “zone reset”—which erases the zone's contents and reverts the zone to an empty state—and then you can write to that page again. But inherently, with an SSD, you can write randomly to a zone if you want to. It just hasn't been standardized yet. Such a zone type would allow the SSDs write amplification to stay the same as sequential write zones. However, there is a different set of tradeoffs, but it is very much possible. It’s just that the first thing we went and standardized was sequential write, getting that stable and working. There are a lot of hardware optimizations that one can do when just sequential writes are allowed, but both having random and sequential writes is possible, it’s just a matter of the hardware resources required to do that.

In general, for the zone interface, you have zones of a specific zone type. Initially ZNS only support a single zone type, where a you have to write them sequentially, and then you do a zone reset. The goal of the initial zone type was to make it as easy as possible to implement software to enforce these workloads. Today there are already applications that have all these things done; they just don’t align their writes to zone boundaries by getting the zone information from the drive yet.

GA: I also alluded to another command called zone append. With the write semantics we have described so far, applications need to guarantee the order of writes is sequential with regard to LBAs. Zone Append is something I’m really excited about because it simplifies this enforcement of a strict append-only write ordering in a way reminscent of nameless writes: data in an append command is appended to the end of a zone’s data, and the location (address) where the data was actually written is returned. So applications don’t know the address of the data until the append has succeeded. We’re still figuring out how to use this with different applications.

MB: We did not write much about zone append in the paper, because we are still studying its applicability to different use cases. Normally, when an application does a write I/O that goes through the Linux kernel, for example, there is a lot of work going on to make sure that—especially for zoned block devices—write I/Os are submitted serially down to the device. And that work has a certain overhead. That overhead can of course be improved so that it is as small as possible, but with zone append, you can say “here’s the data” to the drive and the drive will tell you where it placed the data. One way to look at it is that zone append is a kind of block allocator on the drive. Instead of the host deciding where you write next, you move that responsiblity down to the drive.

There has been literature in the past that has looked at nameless writes, but nameless writes were very much a full-blown implementation where we didn’t have the sequential constraint, like for example with zone append. Its generality meant that it was a much harder problem to solve for the SSD. From the ZNS SSD’s point of view, the only thing it has to do for zone append is receive the data, look at where the current write pointer is for that particular zone, slot the data there, and then tell the host where the data went. The command for appending data is much simpler to implement in hardware. It is one of those cases where simplification becomes more powerful because it gives you certain promises that make it easier to work with.

Locating the Logic

BJ: Speaking of hard problems that need to be solved, if we want to use a new, non-block interface like ZNS, there are some application changes we might need to make in order to communicate to a device that is no longer using the standard block interface. What options do we have as far as where this logic lives? If I wanted to go and implement an application that is aware of ZNS, where could I implement that application within the layered software stack that we have?

GA: Everywhere.

[laughter]

You could implement zone logic on the device, and that’s an FTL. You could do implement zone logic at the block layer or the device mapper layer; I know that Western Digital has done a lot of work to enable this, and alongside the Linux kernel folks. You could implement zone logic in the file system. We tried that with F2FS and talk about that in our paper. You could go all the way to the application, so you could have logic in the application that communicates with a device directly. We’ve done this with RocksDB, for example.

One question is: what are the tradeoffs between implementations at different layers? I don’t want people to think, “Hey, you could implement logic for X in any of these layers, so choose whichever layer you feel more comfortable programming in.” It’s not really like that. The higher up in the storage stack that you are, the closer you are to the user, and the easier it is to express the application’s intent.

I hope one thing that comes across in the paper is that when we implement zone logic as high as into the RocksDB application, performance is slightly higher than if you just try to modify F2FS and run RocksDB unmodified on top of that. And it really comes down to hints. Maybe I’m oversimplifying, but in my mind it comes down to the fact that lower levels have limited insight into application intent.

The Payoffs

BJ: OK, that’s exciting. Not only do we have all of these options, you have actually implemented them. Could you talk about which applications are particularly amenable to the benefits of ZNS, and maybe share some of the highlights from what you’ve done there?

GA: Great fits are applications with write-anywhere semantics such as key-value stores or file systems. One example is the F2FS log-structured file system. Another is RocksDB, a popular key-value store using log-structured merge trees. For us, those were the obvious first targets in terms of applications. But at the same time, they are really exciting applications that are popular in industry.

BJ: One other target that you mentioned before was incorporating ZNS logic to the device mapper or block layer. In your paper, you describe a device mapper implementation that essentially gives you a translation layer, but in the host.

GA: For the block layer, it gets tricky. The solutions that we currently have at that layer guarantee correctness, but not necessarily performance. The reason is that you have to stage and move data just like the FTL does, because you have to assume that the layers above you don’t know about zones, and may perform random writes. Perhaps there is more to be done here starting with the rich literature on SMR drive support. But you’re going to have to pay a cost for that, in the form of I/O amplification for one. You could potentially minimize that using heuristics that fit your application.

BJ: Then you might as well do it in the application at that point.

GA: Maybe. We need to explore more applications to gauge the amount of effort required to enable zone awareness.

BJ: In your paper you showed some real payoffs to making these changes. Could you talk about the ways that this block tax was hurting us before, and how big the gains were once you eliminated it?

GA: The paper has results on throughput and latency improvements that are tangible. These improvements don’t come from a different media; the same SSD hardware was used with different firmware implementing either a traditional FTL or the ZNS interface. The measured improvements are due to getting rid of on-device garbage collection. But I feel like that is only one part of the story. There’s also the monetary cost, but this remains to be determined when ZNS SSDs become widely available.

MB: It is roughly the same kind of work to build a normal SSD and a ZNS SSD, but one of the things that makes it different, in terms of cost, is that media overprovisioning doesn’t need to be there in the same way.

When you get an SSD today, it has this 7% overprovisioning, or 28% for enterprise drives. You pay for that media but you get nothing from it other than higher performance. ZNS gives you that capacity back. So you can use that media previously was not for use by applications.

But one other thing that is really interesting about ZNS SSDs is that, compared to when you fill up an traditional SSD, your performance degrades because the SSD's write amplification increases. On a ZNS SSD, that doesn't apply. Where things really start to take off is that, when you start filling up a non-ZNS drive above, let's say, 60%, your write amplification increases, which impacts overall throughput, latency, and lifespan of the drive. With ZNS, not only do you get this 7 or 28% extra capacity, you also get capacity that you couldn’t use efficiently before, because if you did tap into it, you got lower performance, and it got worse and worse and worse as the drive filled. So today, when evaluating SSDs and using them in practice, people don’t fill them up. The biggest cost of the SSD is the media. The larger the SSD gets today, the media is the bulk cost—more than 90% of the drive [cost] is the media. If we’re only using 60% of it, you’re paying roughly 40% extra for media that we don’t need. That’s also what ZNS does: now you can actually use all of your media while still having high performance.

So it’s significant benefits to be had when fully integrating into the software stack.

Looking Ahead

BJ: Throughout this discussion, it has felt like we've talked around an important question without actually asking it. So let's get philosophical for a second: in your mind, who is ZNS for? I get the impression that the community has not yet reached a consensus on this, so I'm curious who you see as the likely adopters of ZNS.

MB: If you just need one drive, that’s great, but the effort required for an application to use a ZNS SSD does require some engineering. This is why there is all this work being done to lower the barrier to use. But if you’re ordering thousands of SSDs, that’s when it really starts to pay off and you get a real gain for your effort.

This is obviously interesting for hyperscalers, all-flash array vendors, and large storage deployments which integrate it into their storage stacks. However, as ZNS becomes generally available, more and more use-cases will be enabled out-of-box, significantly reducing the barrier to entry and let everyone easily get its benefits in their own setups as well.

GA: Conversations like this one are useful because they allow us to think outside our research bubble where we’re obsessed with the future of ZNS and adding support for these devices in existing systems and applications. One of the biggest pushbacks when trying to publish our paper was that reviewers were not convinced of the applicability of ZNS. But something that is crystallizing for me through face-to-face discussions like this is that we need to do a better job of explaining that ZNS is about freedom of choice. With FTLs you had to accept that data placement and garbage collection would be handled by the SSD’s firmware. With ZNS you can have that logic implemented by the device driver, at the device mapper layer, add support at the file system layer, or even add support directly within your application if that makes sense for you.

Matias was talking about his graduate school dream of turning SSDs into white boxes, and that’s essentially what ZNS is achieving. If you’re uncomfortable with all of this control, then let the layers below handle it. F2FS can do that, as we’ve shown, and you can run your unmodified applications on F2FS. You could have a host-side FTL that implements every feature of FTL SSDs. If you’re one of those organizations that employ hundreds or thousands of engineers that are happy to hack away and optimize your custom storage stack, then you’ll be very excited about this. I expect we’ll realize that there’s more to gain—the predictible performance, the stable throughput, the cost reduction—making it worth abandoning the old way of doing things.

There are questions that we still need to answer, like if you ran different types of applications (beyond RocksDB which we’ve explored) atop a ZNS-enabled F2FS versus F2FS running on a FTL, would it perform worse or would it perform better? So there is a lot to do for the research community there, because obviously the F2FS version we have now is just a proof-of-concept that we’ve put together. [laughter] And the FTLs, as we’ve mentioned, have had decades of work. ZNS is a really exciting opportunity because it’s not about leaving some applications out, but unlocking more options for optimizing the operation of modern storage devices.

Article Categories:

Filesystem/storage

Programming

Last updated February 8, 2023

Authors:

Matias Bjørling is a Distinguished Engineer at Western Digital Research, where he leads the emerging system architectures research group. He is also co-chairing the ZNS working group at NVMe

m@bjorling.me

George Amvrosiadis is a faculty member at Carnegie Mellon University. His group’s current research focuses on emerging storage technologies, cloud storage, high performance computing, and systems for machine learning. He spends part of his time at Amazon’s S3 team as a Visiting Academic.

gamvrosi@cmu.edu

Bill Jannen is a faculty member at Williams College. He spends a lot of his time working with the team that develops BetrFS, a file system built using Bε-tree-based key-value stores.

jannen@cs.williams.edu

Rik Farrow has been a consultant for 40 years. He has written two books, as well as worked as the technical editor for a UNIX magazine and for two editions of a popular operating system book. He also taught UNIX system administration and Internet security during the 90s internationally, and worked as a volunteer for USENIX program and steering committees. Rik has been the editor of ;login: since 2005.

rik@rikfarrow.com

Comments

s/culimation/culmination/

Permalink I70962

s/culimation/culmination/ s/don’t align it their writes/don’t align their writes/ s/that’s doesn't apply/that doesn't apply/ s/lowerthe/lower the/ s/storage deployments which integrates it/storage deployments which integrate it/ (number agreement) s/let everyone can easily get its benefits/let everyone easily get its benefits/

1 year 5 months ago

Typos like these used to be

Permalink Rik

Typos like these used to be fixed by the copy editor, sigh. Thanks for pointing them out, as I will fix them.

1 year 5 months ago