Complexity of FreeBSD VFS using ZFS as an example. Part 1.

I spend a lot of time hacking on the ZFS port to FreeBSD and fixing various bugs. Quite often the bugs are specific to the port and not to the OpenZFS core. A good share of those bugs are caused by differences between VFS models in Solaris and its descendants like illumos, and FreeBSD. I would like to talk about those differences.

But first a few words about VFS in general. VFS stands for “virtual file system”. It is an interface that all concrete filesystem drivers must implement so that higher level code could be agnostic of any implementation details. More strictly, VFS is a contract between an operating system and a filesystem driver.

In a wider sense VFS also includes the higher level filesystem-independent code that provides the more high level and convenient interfaces to the consumers. For example, a filesystem must implement an interface for looking up an entry by name in a directory. VFS provides a more convenient interface that allows to perform a lookup using an absolute or a relative path given a starting directory. Additionally, VFS in a wider sense includes utility code that could be shared between different filesystem drivers.

VFS Overview

VFS Overview

Common VFS models for UNIX and UNIX-like operating systems have some common requirements on a structure of a filesystem. First, it is assumed that there are special filesystem objects called directories that provide mapping from names to other filesystem objects that are called directory entries. All other filesystem objects contain data or provide other utilities. The directories form a directed rooted tree starting with a specially designated root directory. In other words, it’s a connected rooted directed acyclic graph where each edge has a name associated with it. Non-directory objects may be reachable by multiple paths. Alternatively, a non-directory object can be a directory entry in more than one directory or it can appear as multiple entries with different names in a single directory. Conventionally those multiple paths to a single object are referred to as hard links. Additionally, there is a designated object type called a symbolic link that can contain a relative or an absolute path. When traversing such a symbolic link object a filesystem consumer may jump to the contained path called a symbolic link destination. It is not required to so, however. The symbolic links allow to create appearence of arbitrary topologies including loops or broken paths that lead nowhere.

A directory must always contain two special entries:

  • “.” (dot) refers to the directory itself
  • “..” (dot dot) refers to a parent directory of the directory or to itself for the root directory

Each filesystem object is customarily referred to as an inode, especially in the context of a filesystem driver implementation. VFS requires that each filesystem object must have a unique integer identifier referred to as an inode number.

At the VFS API layer the inodes are represented as vnodes where ‘v’ stands for virtual. In object oriented terms the vnodes can be thought of as interfaces or abstract base classes for the inodes of the concrete filesystems. The vnode interface has abstract methods known as vnode operations or VOPs that dispatch calls to concrete implementations.

Typically an OS kernel is implemented in C, so object oriented facilities have to be emulated. In particular, a one-to-one relation between a vnode and an inode is established via pointers rather than by using an is-a relationship. For example, here is how a method for creating a new directory looks in FreeBSD VFS:

     int
     VOP_MKDIR(struct vnode *dvp, struct vnode **vpp,
         struct componentname *cnp, struct vattr *vap);

dvp (“directory vnode pointer”) is a vnode that represents an existing directory; the method would be dispatched to an implementation associated with this vnode. If the call is successful, then vpp (“vnode pointer to pointer”) would point to a vnode representing a newly created directory. cnp defines a name for the new directory and vap various attributes of it. The same method in Solaris VFS has a few additional parameters, but otherwise it is equivalent to the FreeBSD VFS one.

It would be wasteful or even plain impossible to have vnode objects in memory for every filesystem object that could potentially be accessed, so vnodes are created upon access and destroyed when they are no longer needed. Given that C does not provide any sort of smart pointers the vnode life cycle must be maintained explicitly. Since in modern operating systems multiple threads may concurrently access a filesystem, and potentially the same vnodes, the lifecycle must be controlled by a reference count. All VFS calls that produce a vnode such as lookups or new object creation return the vnode referenced. Once a caller is done using the vnode it must explicitly drop a reference. When the reference count goes to zero the concrete filesystem is notified about that and should take an appropriate action. In Solaris VFS model the concrete filesystem must free both its implementation specific object and the vnode. In FreeBSD VFS the filesystem must handle its private implementation object, but the vnode is handled by the VFS code.

In practice an application may perform multiple accesses to a file without having any persistent handle open for it. For example, the application may call access(2), stat(2), etc system calls. Also, for example, lookups by different applications may frequently traverse the same directories. As a result, it would be inefficient to destroy a vnode and its associated inode as soon as its use count reaches zero. All VFS implementations cache vnodes to avoid the expense of their frequent destruction and construction. Also, VFS implementations tend to cache path to vnode relationships to avoid the expense of looking up a directory entry via a call to a filesystem driver, VOP_LOOKUP.

Obviously, there can be different strategies for maintaining the caches. For example, a life time of a cache entry could be limited; or total size of the cache could be limited and any excess entries could be purged in a least recently used fashion or in a least frequently used fashion. And so on.

Solaris VFS combines the name cache and the vnode cache. The name cache maintains an extra reference on a vnode and so it is not recycled as long as it is present in the name cache. The advantage of this cache unification is simplicity. The disadvantage is that the two caching modes are coupled. If a filesystem driver for whatever reason would want all lookups to always go through it and didn’t use the name cache, then there would not be any vnode caching for it as well.

As soon as the vnode reference count goes to zero Solaris VFS invokes VOP_INACTIVE method which instructs the filesystem driver to free the vnode and all internal resources associated with it. Theoretically, the filesystem driver could have its internal cache of vnodes but that does not seem to happen in practice.

FreeBSD VFS maintains separate caches and as a result it has a more complex vnode lifecycle. First, FreeBSD VFS maintains two separate reference counts on a vnode. One is called a use count and is used to denote active uses of the vnode such as by a system call in progress. The other is called a hold count and it denotes “passive” uses of the vnode, which means that a user wants a guarantee that its vnode pointer stays valid (e.g. it would not point to freed memory), but the user is not going to perform any operations on the vnode. The hold count is used, for instance, by FreeBSD VFS name cache, but there are other uses. vnode usage implies vnode hold, so every time the use count is increased or decreased, the same is done to the hold count. As a result the hold count is always greater or equal to the use count. When the use count reaches zero FreeBSD VFS invokes VOP_INACTIVE method, but it has a radically different meaning from VOP_INACTIVE in Solaris VFS. This is just a chance for the filesystem driver to perform some maintenance on a vnode, but the vnode must stay fully valid and thus its associated inode must stay valid. An example of the maintenance is removing a file that was unlinked from a filesystem namespace but was still open by one or more application. A vnode with zero use count is considered to be in an inactive state. Conversely, a vnode with non-zero use count is said to be in an active state.

When the hold count reaches zero the vnode is not immediately freed, but is transitioned to a so called free state. In that state the vnode stays fully valid, but is subject to being freed at any time unless it is used again. The free vnodes are placed on a so called free list, which is in essence the vnode cache.

FreeBSD VFS has configurable targets for a total number of vnodes and vnodes in the free state. When the targets are exceeded the free vnodes get reclaimed. This is done by invoking VOP_RECLAIM method. The filesystem driver must free all its internal resources associated with a reclaimed vnode. The reclaimed vnode is marked with a special DOOMED flag. That flag is an indication that the vnode is invalid in the sense that it is not associated with any real filesystem object. Any operations on such a vnode return an error or in some cases lead to a system crash. Thus, we have another vnode state that can be called doomed or less dramatically reclaimed.

While being reclaimed the vnode must be held (its hold count greater than zero) at least by the code that initiates reclamation. Once all holds are released, the vnode that is DOOMED and that has zero hold count is really destroyed.

In FreeBSD VFS the hold count does not guarantee that the vnode remains valid. If the total vnode count exceeds the target and there are not enough free vnodes to meet the target, then inactive vnodes (zero use count, non zero hold count) can be reclaimed as well. As described above, the reclaimed vnode will stay around as long as there are any holds on it. That guarantees that dereferencing a vnode pointer is safe, but does not guarantee safety of any operations on the vnode. A vnode holder must check vnode state (often implicitly) before actually using it. In this sense a hold should be considered as a weak reference.

The complexity of FreeBSD vnode lifecycle management obviously needs a justification. Continuing the last sentence of the previous paragraph it would be tempting to declare the FreeBSD use count and the Solaris reference count to be a strong vnode reference. Not quite so…

And now it is time to introduce the first real world complexity that any VFS (that supports a feature) must deal with — forced unmounting. In some situations it is desirable to unmount a filesystem even though it is still in active use. This means that there are active vnodes that have non-zero use / reference count that are going to end up in an explicit or implicit doomed state because the actual filesystem objects will no longer be accessible.

In FreeBSD this is handled by “forcefully” reclaiming the active vnodes and transitioning them to the explicit doomed state. Solaris VFS does not have a state like that on the VFS level, so this state must be implemented in each concrete filesystem. Since the vnode will still appear as a valid vnode its inode must be kept around. Typically there would be a data member in the inode that would be marked with a special value to denote that the inode is not actually valid. Then every VOP of every concrete filesystem must check its special “doomed” tag and abort operation in an appropriate manner. Only when all references on the vnode are dropped will it and its inode be actually destroyed.

So, this is the first example of FreeBSD VFS trying to handle a common problem by itself and thus gaining complexity, whereas Solaris VFS stays simpler at the cost of deferring complexity to the concrete filesystems.

There are trade offs in both approaches. Generalization reduces code duplication and maintenance, but increases the complexity of the generalized code and reduces flexibility. Leaving the problem to each concrete filesystem increases code duplication and the effort required for developing a new filesystem, but it also allows for greater flexibility of each filesystem implementation.


Part II of this article is coming next month, follow us on Twitter for updates!

At ClusterHQ we’re changing the rules for cloud infrastructure: our replication system, built on top of OpenZFS, makes it possible to deliver a cloud with resilience and auto-scaling built-in, rather than needing to be re-engineered from scratch by developers every time they build a cloud app. Give it a try for free today.

Get Involved

Sign up for email updates about Flocker

  • Robert M. Koretsky

    Andriy, this is an excellent top-down view of ZFS! I was wondering why inodes were still referred to when using ZFS, and I see the clear distinction between a Virtual File System and it’s internal representations and a concrete system below it in your illustration. So I take it that inodes are still used in the chain of actually implementing the ZFS transactions. I’m still trying to get my head around that idea. Thanks for providing this post!
    Sincerely,
    Robert M. Koretsky
    UNIX: The Textbook 3rd edition co-author