Copy-on-Write Version Support for VFS under LinuxStephan Müller und Sven Widmer
The goal of project VVFS is to provide versioning support for regular files in the VFS layer of the Linux kernel. A new file version is created if a file is opened with the write flag. Aside from kernel mode changes, two user tools have been developed.
This paper briefly describes different appendages in version management and identifies related implementations. Subsequently a short introduction into the Linux file system architecture is given followed by a detailed description of changes made to the Linux kernel. Finally the performance of the implementation is evaluated and future work is identified.
- Related Work
- Linux File System Architecture
- Conclusion and Future Work
Whenever it comes to the modification of files, it might be desirable to recover old versions or to trace the process of modification. This can be archived by powerful tools like CVS and Subversion running within the userspace or within the file system itself.
Within the file system category the following subtypes can be distinguished:
Explicit - With the help of special tools
different file versions are maintained in cooperation with the
user. E.g. 3DFS provides the commands
checkoutfor version management meanwhile CFS implements a client/server approach in which a new version is created if the users uploads a file to the server.
- Snapshot/Checkpoint - Generally used by backup systems this method creates an image of the file system representing its state at a given point of time. Later, snapshots can be used to restore all data up to this moment. Examples implementing this versioning type are AFS and Plan9.
- Copy-On-Write (COW) - Instead of writing modifications to the original file a copy of this file is created. All changes are applied to this new version. File systems using this type are e.g. Files-11 (OpenVMS) and the Elephant file system.
The advantage of the COW approach implemented in the file system is that every program can make use of it. No special interface is needed because a normal write access creates a new version. This method is also transparent for the user as he has not to initiate the process of creating a new version. In contrast to the snapshot mechanism not all files have to be restored - in fact one particular version can be accessed again.
The concepts used in this paper have their origins in the Files-11 file system family of the operating system VMS. The following section describes some basic features of the Files-11 file system semantic which served as a template for project VVFS.
In VMS a file name consists of a name, a type identifier and a
version number. The version is separated by a semicolon (e.g.
fobar.txt;42). VMS supports up to 32767 different file
versions. This number can be limited by a directory property.
VMS provides the Distributed Command Language (DCL) for user
interaction. A directory for instance is created with the DCL
CREATE/DIRECTORY. If the additional qualifier
/VERSION_LIMIT is specified, the directory is created
with this limit otherwise the version limit is inherited from the
parent directory. Additionally this property can be altered with
SET FILE/VERSION_LIMIT later. If the version
limit is set to
-0 versioning of files is disabled.
If the user creates a new file VMS will assign the version number
1. Before changes of a file are written to the storage
media the largest version number is determined and incremented by
one. Afterwards, a new file with this version number is created and
filled with all data from the previous version and the changes made
in the meantime. This behavior corresponds to the COW semantic
discussed in Chapter 1.
When the user wants to access a file and he specifies no particularly version number the latest version is opened. If a intermediate version is modified and saved it will become the newest version with the largest number.
During the modification process of a file the version number is always incremented. If the adjusted version limit is exceeded the oldest version will be deleted. This leads to large version numbers and a unused lower domain.
For maintenance reasons the DCL command
PURGE has been
provided. With the help of this tool the user can remove older file
versions. Additionally every call of
PURGE reassigns the
remaining versions beginning with
In order to allow a clear classification of the changes made to the Linux kernel this chapter outlines the Linux file system architecture.
Figure 3.1 shows the basic components responsible for the file management in the Linux OS. The upper region displays the user mode and the lower area the kernel mode. At the very top of the picture applications use the standard library libc to access files. The library maps these request to system calls in order to enter the kernel mode.
The gateway for all file related operations is the Virtual File System (VFS). It provides an interface between the system libraries and the specific file system implementations like Ext3, ReiserFS and NFS. Thereby VFS serves not only as a abstraction layer, but in fact it provides a basic implementation of a file system which can be used and extended by different implementations. Additionally the VFS implements optimization methods like the directory entry cache which buffers recently accessed directory structures in order to reduce the latency of a path resolution.
Starting from features present in VMS, this chapter identifies different options of a copy-on-write based file system implementation for the Linux kernel.
Like in VMS it should be possible to specify the maximum number of versions allowed for a file on a per directory basis. This implies two tasks. Where should this meta information be stored and through what interface it can be altered?
4.1.1 Storage of DirMaxV
There are different ways to store DirMaxV as a directory specific
property. One option is to maintain the number in a separate file.
This file should be hidden to the user. This implies changes to the
readdir which displays all files of a directory.
Additionally a unique file name must be chosen and naming conflicts
with other files have to be prevented.
Another possible solution is to store DirMaxV in the inode of each directory. For compatibility reasons this step is only an option if there is reserved memory within the inode struct. So this approach depends on the specific file system implementation.
Because both options have drawbacks and a generic meta data
management is not available the decision of the storage location of
DirMaxV is left to the underlying file system. Therefor, the VFS
inode operation struct is extended by the two function pointers
setmaxversions. Every file
system which wants to support versioning has to implement these
For Ext2 this has been done exemplary. The Ext2 inode struct defines
four reserved bytes for the Linux OS. Two of them have been used for
storing DirMaxV. The variable is called
The other two bytes remain reserved (
4.1.2 Interface for the management of DirMaxV
The maximum number of file versions must be adjustable by the user. Therefor, an interface between user and kernel space is needed.
A possible solution is to introduce two new system calls for setting
and getting this value. Every system call corresponds to a unique
number. This association is maintained in the header file
include/linux/syscalls.h. The disadvantage of extending the
list of system calls is that this affects the hole kernel although
the changes are just file system related. Furthermore as long as
this modification is not included in the mainstream kernel the new
system calls can collide with other projects also introducing new
A better solution is to use the system call
ioctl which was
primarily introduced for device operations. It takes at least two
parameters, a file descriptor on which the operation is invoked and
the requested operation itself. Optional Parameters can be used to
pass data to kernel space and receive results.
The VVFS project introduces the
operations require a third parameter of the type
takes the desired value if DirMaxV is set. Respectively it returns
the current value of DirMaxV if it is requested.
Within the kernel space the function vfs_ioctl was extended. It
makes use of the previously mentioned function pointers
setmaxversion. In the case of
Ext2 the functions
ext2_get_max_versions have been registered to this
Another point is the security issue. Of cause only authorized users should be allowed to read and write DirMaxV. The normal approach would be to check if the file was opened with write permissions before granting change requests. In this case the file is a directory and directories can only be opened with read access because they are only modified by altering the files they contain.
Our solution incluces a function
conjunction with the inode of the directory and the access mask
MAY_WRITE this functions checks whether the user is capable
of performing modifications.
If different versions of a file should be maintained the question is how to distinguish between these versions and how they are presented to the user.
Like VMS this project stores the version number in the file name itself. The advantage of this approach is that no meta data or hidden files are needed. All versions are visible and accessible via the file name. The drawback is that a unique character has to be chosen which separates the version number from the base name.
VMS uses the semicolon as separator. In UNIX-like operating systems
this character is used for separating different commands entered
successively in one line. Therefor, it can not be applied. Instead
the mesh character (
#) was chosen. The kernel code uses the
VERSION_SEPARATOR when referring to the separator. It
is defined in the header file
fs.h which is also accessible
for user space code.
The task of separating the version number from the file name is
performed by the function
Starting from the last character of file name this function scans
the string backwards for the first occurrence of the version
separator. The string left of the place of finding will be
interpreted as the base name meanwhile the the string on the right
side is converted to an integer.
The following table lists valid and invalid input strings for this function.
For convenience reasons it should be possible to open the latest
file without specifying any version number. An easy way to support
this feature is to use a link that always points to the latest
version. Project VVFS uses soft links for this purpose meanwhile
hard links could also fulfill this task. Figure 4.1 shows three
versions of the file
foobar and a soft link pointing to the
In order to keep this link functional it has to be updated by the file system if the latest version was deleted or a new version was created. This seems to be a simple task. You could think that decrementing or incrementing the version number would fully suffice. But if the second largest version has been deleted in the meantime or the version which was altered was not the latest one this strategy would fail.
This problem is solved by the function
scans the whole directory for files with the same base name. For
this task it uses the function
readdir together with the
version_on_file_found. For each directory entry the
handler is invoked. Instead of just adding the file name to a
version_on_file_found separates the version number
from the base name using the previously mentioned function
version_separate. If the name matches the target name it
compares the version number with previously found version numbers
which are stored in the struct
name_and_version. In this
way the largest, second largest and smallest number as well as the
over all found versions are stored. The last two values are needed
to check whether the version limit (DirMaxV) is exceeded. In this
case the oldest version must be removed.
Figure 4.2 gives an overview of all functions involved. In this diagram every frame represents a function call. If further functions are invoked inside the current one they appear indented. Sequential calls are aligned at the same column. Functions which originate from the vanilla kernel are colored green whereas new introduced functions are colored yellow.
The idea behind the copy-on-write approach is that whenever changes are written to a file a new version is created without modifying the current one. In order to enable the Linux kernel to support such a behavior it is clear that some file system related system calls have to be modified.
The first candidate which seems suitable is
system call is invoked on a opened file descriptor together with a
buffer containing the bytes to be written. The drawback of this
approach is that
sys_write might be invoked several times
for one write back process if the amount of data exceeds the buffer
size. This would lead to many different versions and therefor this
approach has been discarded.
The second thought was to use the VMS approach which creates a new
version if a file is closed. In Linux this is done by the system
sys_close. The disadvantage of this strategy is that a
temporary file is needed to absorb all intermediate changes in order
to prevent the original version from being altered. This might be
appliable for an operating system supporting such a feature from the
very beginning. But in Linux the user mode applications assume that
they work on a single and complete file.
A workaround for this problem is to modify the system call
sys_open as well. It could create the temporary file which
would later be renamed to the new version by
this scenario the renaming of the temporary file can be save if the
open call itself is creating the new version. This thought leads to
the finally chosen approach which passes the task of creating a new
sys_open. If it is invoked in combination with the
write flag a file descriptor of a new version is returned instead of
the original addressed one.
Figure 4.3 shows the main functions of the modified system call. At
first an available file descriptor is determined by
get_unused_fd. Afterwards the kernel specific
representation of a file in terms of a
file struct is
filp_open. Finally the
file struct and
the descriptor are tight up and stored process specific by
filp_open all the major work is done. Initially the
open_namei is called which uses the function
lookup_hash to resolve a
nameidata structure for
the given file name. Then
may_open is invoked. Aside from
checking permission it calls the function
the flushing of the file is requested. This is the first place which
had to be modified since with versioning switched on the original
must not be modified. On this account the flag
may_open is invoked. In order to still
fulfill the request of an empty file in this case no content is
copied from the original file to the new version.
By default the second function called by
dentry_open. Its task is to return the important pointer to
the previously mentioned
file struct. This function was
version_dentry_open which contains major parts
of the implementation of VVFS.
version_latestfile is invoked which uses
version_concatenate to determine
the file name for the new version. Afterwards a new directory entry
is created by
called twice to get a file pointer for the old and the new version.
Subsequently these pointers are passed to
which copies the data (if the truncate flag was not specified).
In closing, the version link is updated and the oldest version is
deleted if necessary. For the assembling of the name of the oldest
version_delete can make use of the
name_and_version struct which was filled by
Many tools and applications are using files or buffers to store temporary changes. One observable behavior is to work on a temporary file and renaming it later on to the original addressed file. In this case the creation of new versions is avoided and the previous file is overwritten with new data.
Another scenario, which can be observed in combination with the
VIM, is the usage of buffers. At first the file is
opened in read mode and copied into a buffer. All changes are made
to this buffer. When the file is saved the original file is renamed
for backup reasons. Then a new file with the original name is opened
in write mode and the buffer is copied to this file. In this case
all information about the previous version is lost during the rename
In order to respond to these issues a copy-on-rename mechanism has been implemented.
Figure 4.4 displays parts of the system call
which is responsible for the rename process. Given two file names
sys_rename uses the function
determine the corresponding directory entries. With the help of
these entries it is possible to decide whether the directory, each
file name belongs to, supports versioning or not.
If the source directory has this feature the rename call must be
prevented from unlinking the source file. This task is handled by
the new function
version_src_rename. If only the
destination supports versioning a new version must be determined for
the target file name. If neither the source nor the destination
directory has enabled version support the unmodified
vfs_rename is used.
version_src_rename initially opens the source
file. If the destination directory has no version support the target
file name can be opened as well. Otherwise a new version number must
be determined. This task is similar to the
procedure with the exception that no content from a previous version
is copied. Because of this similarity the function
version_dentry_open can be used again. If two opened file
pointers exist the content of the source can be copied to the
version_dst_rename copies no data. Instead
only a new version is determined by the known function
version_latest_file. Then the work can be transfered to
vfs_rename. Only the version link has to be updated
In order to enable the user to control and maintain the versioning file system two user mode tools have been developed.
4.6.1 Manage the maximum number of version --
The purpose of
dirmaxv is to read and set the maximum
number of file versions allowed for a directory. It may be called
the following way:
dirmaxv [-s number] <directory>
It is mandatory to specify the directory
dirmaxv is invoked
on. If the optional parameter
-s together with a number is
passed this value will be set for the new version limit. Otherwise
the current value of DirMaxV is returned.
The implementation of this tool is rather simple because all it does
is to call the three standard library functions
close. The first one returns a file
descriptor for the specified directory. Then the system call
sys_ioctl is invoked together with the parameter
FISETMAXVERSIONS depending on
whether DirMaxV should be read or set. Finally the file descriptor
is released by the
4.6.2 Cleaning up with
While working with a versioning file system new files are created
and older ones are deleted. This leads to increasing and fragmented
version numbers. In order to clean up a directory the command
purge (known from VMS) has been implemented.
purge [[-k keep versions] | [-d delete versions]]
Purge is invoked on a file name or more precisely on a base name of
a file. It fulfills two task. First, it deletes all unnecessary
versions. Second, it resets the version numbers of all remaining
versions beginning with
0 and the oldest file. All gaps are
purge is called with no further parameters all versions
except the latest one are deleted. The parameter
specifies how much versions should remain. With the help of
-d a particular version or a range of versions can be
marked for removal.
The implementation uses a three phase approach. In the fist step all
version numbers of a file are collected in a double linked list.
Every new item is inserted in order of its value. Then this list is
iterated downwards or upwards depending on whether
versions should be kept or
d versions should be deleted. In
this step all unwanted versions are deleted. The last step resets
the version numbers. This is done by iterating the list upwards and
incrementing a counter variable.
Figure 4.5 shows the execution of purge on the three versions of the
foobar. After the call the latest two versions remain
with new assigned version numbers.
This chapter describes the performance evaluation of the versioning
extensions. The focus of the investigation lies in the time needed
by the system call
open to create a new file version. This
mainly depends on the number of files within the current directory
and the file size.
All measurements have been executed on a Intel Pentium 4 CPU with 3.06GHz and 512 KB Level 2 Cache. Furthermore the test machine was equipped with 512 MB RAM and a 80 GB SATA hard disk. The kernel used was build from the 2.6.12 vanilla sources with and without the VVFS patch.
All unneeded applications and especially daemons like Cron and SSH had been shut down. For every test case 24 recurrences have been made. The first and second result as well as the two worst ones have been discarded. So the final result was computed from the average of the remaining 20 values.
Whenever a new file version is created the whole directory is scanned for files with the same base name in order to determine the latest version number. The first test addresses that problem and evaluates the performance development in relation to the number a files present in a directory.
Both diagrams are representing the same data with the difference that the first one (5.1a) uses a logarithmic scaled axis for the latency values.
The magenta line in 5.1a shows the latency measured with the vanilla
kernel without any modifications. The average value lies slightly
below one microsecond (around 0.8). This value is constant because
the test can make use of the directory cache. The blue line
represents the VVFS extended kernel with disabled versioning for the
current directory (
DirMaxV=0). Hereby 1.2 microsecond is
the average value. The difference between both lines results from
the time needed to determine the value of
The green and the red graph are showing the behavior of a VVFS patched kernel with enabled versioning. In addition to the green graph the red one represents the case that with the creation of the new version the version limit is exceeded. Therefor the oldest version has to be deleted. Figure 5.1b shows that both graphs are growing linear. If the directory contains more than 10,000 entries the latency exceeds three milliseconds.
The reason for such a bad scaling factor is that the processing of each entry involves two string copies, one string compare, searching within a string and the transformation of a string to an integer.
Nevertheless the red graph is growing just slightly more than the green one. That is because with the determination of the latest version the oldest one is collected as well.
With the creation of a new version the content of the current
version must be copied. Therefore the latency of the
system call is growing with an increasing file size.
Figure 5.2 compares the time needed for a user mode copy process
(red graph) with the
open call in the VVFS kernel without
DirMaxV (green graph). With a increasing number
of copied blocks the VVFS graph gains a performance advantage. The
reason for this behavior is the increasing number of context
switches needed for the user mode test.
The performance evaluations have shown that it is very expensive to determine the latest file by searching the directory for all available versions. With an increasing number of directory entries the latency is growing linear. On the other hand this tests have proven that coping data in kernel space is slightly faster than doing the job out of the user mode.
Nevertheless the performance issue has to be addressed otherwise the
execution of the system call
open with the write flag will
bother the kernel thread for a while. On Non-SMP machines
CONFIG_SMP=n) with disabled kernel preemption
CONFIG_PREEMPT=n) the whole system will block during the
This chapter briefly outlines future work and summerize the current project state.
The state of project VVFS is pre alpha and can be considered as a proof of concept implementation. The following itemization addresses major topics of future development:
- Thread safety - The current implementation can not be considered as thread safe.
- Performance - Right now the whole directory has to be iterated in order to determine the smallest and largest version number. Future implementations should make use of meta data structures.
- Incremental versions - Instead of creating a plain copy whenever a new version is constructed only the differences between two versions should be stored.
The virtual file system of the Linux kernel has been extended by a
copy-on-write versioning mechanism based on the model of VMS. The
main changes include the introduction of the
FISETMAXVERSION as well as
major modifications of system calls like
sys_unlink and their subroutines.
A new file version is created if
sys_open is invoked
together with the write flag. The latest version is represented by a
soft link. If the specified version limit is exceeded by the new
version the oldest one is deleted automatically. Aside from kernel
modifications two user mode tools have been developed to support the
user dealing with this versioning file system.
VVFS was proved and tested with some applications like VIM and The Gimp.
You can download the kernel patch from SourceForge.net .
A complete description of the project (Semesterarbeit) is available in german.
- [3DFS] D.G. Korn and E. Krell. The 3-D File System. In Proc. of the USENIX Summer Conference, pp. 147-156, 1989.
- [CFS] D. K. Gifford, R. M. Needham and M. D. Schroeder. The Cedar File System. Communication of the ACM, pp. 288-298, 1988.
- [ElephantFS] D. S. Santry, M. J. Feeley, N. C. Hutchinson, A. C.Veitch, R.W. Carton and J. Ofir. Deciding When to Forget in the Elephant File System. Proceedings of the 17th ACM Symposium on Operating Systems Principles. ACM Press. pp. 110-123, 1999