Copy-on-Write Version Support for VFS under Linux

Stephan Müller und Sven Widmer

Abstract

The goal of project VVFS is to provide versioning support for regular files in the VFS layer of the Linux kernel. A new file version is created if a file is opened with the write flag. Aside from kernel mode changes, two user tools have been developed.

This paper briefly describes different appendages in version management and identifies related implementations. Subsequently a short introduction into the Linux file system architecture is given followed by a detailed description of changes made to the Linux kernel. Finally the performance of the implementation is evaluated and future work is identified.

Table of Contents

  1. Introduction
  2. Related Work
  3. Linux File System Architecture
  4. Implementation
    1. DirMaxV
    2. Version number
    3. Representation of the latest version
    4. Copy-On-Write (COW)
    5. Copy-On-Rename
    6. User mode tools
  5. Performance
    1. Environment and test setup
    2. Number of file entries
    3. File size
    4. Conclusion
  6. Conclusion and Future Work
    1. Future Work
    2. Conclusion
  7. Download
  8. Literatur

Introduction

Whenever it comes to the modification of files, it might be desirable to recover old versions or to trace the process of modification. This can be archived by powerful tools like CVS and Subversion running within the userspace or within the file system itself.

Within the file system category the following subtypes can be distinguished:

  • Explicit - With the help of special tools different file versions are maintained in cooperation with the user. E.g. 3DFS provides the commands checkin and checkout for version management meanwhile CFS implements a client/server approach in which a new version is created if the users uploads a file to the server.
  • Snapshot/Checkpoint - Generally used by backup systems this method creates an image of the file system representing its state at a given point of time. Later, snapshots can be used to restore all data up to this moment. Examples implementing this versioning type are AFS and Plan9.
  • Copy-On-Write (COW) - Instead of writing modifications to the original file a copy of this file is created. All changes are applied to this new version. File systems using this type are e.g. Files-11 (OpenVMS) and the Elephant file system.

The advantage of the COW approach implemented in the file system is that every program can make use of it. No special interface is needed because a normal write access creates a new version. This method is also transparent for the user as he has not to initiate the process of creating a new version. In contrast to the snapshot mechanism not all files have to be restored - in fact one particular version can be accessed again.

top

Related Work

The concepts used in this paper have their origins in the Files-11 file system family of the operating system VMS. The following section describes some basic features of the Files-11 file system semantic which served as a template for project VVFS.

In VMS a file name consists of a name, a type identifier and a version number. The version is separated by a semicolon (e.g. fobar.txt;42). VMS supports up to 32767 different file versions. This number can be limited by a directory property.

VMS provides the Distributed Command Language (DCL) for user interaction. A directory for instance is created with the DCL command CREATE/DIRECTORY. If the additional qualifier /VERSION_LIMIT is specified, the directory is created with this limit otherwise the version limit is inherited from the parent directory. Additionally this property can be altered with the command SET FILE/VERSION_LIMIT later. If the version limit is set to -0 versioning of files is disabled.

If the user creates a new file VMS will assign the version number 1. Before changes of a file are written to the storage media the largest version number is determined and incremented by one. Afterwards, a new file with this version number is created and filled with all data from the previous version and the changes made in the meantime. This behavior corresponds to the COW semantic discussed in Chapter 1.

When the user wants to access a file and he specifies no particularly version number the latest version is opened. If a intermediate version is modified and saved it will become the newest version with the largest number.

During the modification process of a file the version number is always incremented. If the adjusted version limit is exceeded the oldest version will be deleted. This leads to large version numbers and a unused lower domain.

For maintenance reasons the DCL command PURGE has been provided. With the help of this tool the user can remove older file versions. Additionally every call of PURGE reassigns the remaining versions beginning with 1.

top

3. Linux File System Architecture

In order to allow a clear classification of the changes made to the Linux kernel this chapter outlines the Linux file system architecture.

VFS interface
Figure 3.1 - VFS interface

Figure 3.1 shows the basic components responsible for the file management in the Linux OS. The upper region displays the user mode and the lower area the kernel mode. At the very top of the picture applications use the standard library libc to access files. The library maps these request to system calls in order to enter the kernel mode.

The gateway for all file related operations is the Virtual File System (VFS). It provides an interface between the system libraries and the specific file system implementations like Ext3, ReiserFS and NFS. Thereby VFS serves not only as a abstraction layer, but in fact it provides a basic implementation of a file system which can be used and extended by different implementations. Additionally the VFS implements optimization methods like the directory entry cache which buffers recently accessed directory structures in order to reduce the latency of a path resolution.

top

4. Implementation

Starting from features present in VMS, this chapter identifies different options of a copy-on-write based file system implementation for the Linux kernel.

4.1 DirMaxV

Like in VMS it should be possible to specify the maximum number of versions allowed for a file on a per directory basis. This implies two tasks. Where should this meta information be stored and through what interface it can be altered?

4.1.1 Storage of DirMaxV

There are different ways to store DirMaxV as a directory specific property. One option is to maintain the number in a separate file. This file should be hidden to the user. This implies changes to the function readdir which displays all files of a directory. Additionally a unique file name must be chosen and naming conflicts with other files have to be prevented.

Another possible solution is to store DirMaxV in the inode of each directory. For compatibility reasons this step is only an option if there is reserved memory within the inode struct. So this approach depends on the specific file system implementation.

Because both options have drawbacks and a generic meta data management is not available the decision of the storage location of DirMaxV is left to the underlying file system. Therefor, the VFS inode operation struct is extended by the two function pointers getmaxversions and setmaxversions. Every file system which wants to support versioning has to implement these functions.

For Ext2 this has been done exemplary. The Ext2 inode struct defines four reserved bytes for the Linux OS. Two of them have been used for storing DirMaxV. The variable is called l_i_max_versions. The other two bytes remain reserved (l_i_reserved2).

4.1.2 Interface for the management of DirMaxV

The maximum number of file versions must be adjustable by the user. Therefor, an interface between user and kernel space is needed.

A possible solution is to introduce two new system calls for setting and getting this value. Every system call corresponds to a unique number. This association is maintained in the header file include/linux/syscalls.h. The disadvantage of extending the list of system calls is that this affects the hole kernel although the changes are just file system related. Furthermore as long as this modification is not included in the mainstream kernel the new system calls can collide with other projects also introducing new system calls.

A better solution is to use the system call ioctl which was primarily introduced for device operations. It takes at least two parameters, a file descriptor on which the operation is invoked and the requested operation itself. Optional Parameters can be used to pass data to kernel space and receive results.

The VVFS project introduces the ioctl constants FIGETMAXVERSIONS and FISETMAXVERSIONS. Both operations require a third parameter of the type int which takes the desired value if DirMaxV is set. Respectively it returns the current value of DirMaxV if it is requested.

Within the kernel space the function vfs_ioctl was extended. It makes use of the previously mentioned function pointers getmaxversions and setmaxversion. In the case of Ext2 the functions ext2_set_max_versions and ext2_get_max_versions have been registered to this pointers.

Another point is the security issue. Of cause only authorized users should be allowed to read and write DirMaxV. The normal approach would be to check if the file was opened with write permissions before granting change requests. In this case the file is a directory and directories can only be opened with read access because they are only modified by altering the files they contain.

Our solution incluces a function generic_permission. In conjunction with the inode of the directory and the access mask MAY_WRITE this functions checks whether the user is capable of performing modifications.

4.2 Version number

If different versions of a file should be maintained the question is how to distinguish between these versions and how they are presented to the user.

Like VMS this project stores the version number in the file name itself. The advantage of this approach is that no meta data or hidden files are needed. All versions are visible and accessible via the file name. The drawback is that a unique character has to be chosen which separates the version number from the base name.

VMS uses the semicolon as separator. In UNIX-like operating systems this character is used for separating different commands entered successively in one line. Therefor, it can not be applied. Instead the mesh character (#) was chosen. The kernel code uses the macro VERSION_SEPARATOR when referring to the separator. It is defined in the header file fs.h which is also accessible for user space code.

The task of separating the version number from the file name is performed by the function separate_name_and_version. Starting from the last character of file name this function scans the string backwards for the first occurrence of the version separator. The string left of the place of finding will be interpreted as the base name meanwhile the the string on the right side is converted to an integer.

The following table lists valid and invalid input strings for this function.

valid invalid
foo#1 foo
foo#23 foo#
foo#bar#42 foo#1a

4.3 Representation of the latest version

For convenience reasons it should be possible to open the latest file without specifying any version number. An easy way to support this feature is to use a link that always points to the latest version. Project VVFS uses soft links for this purpose meanwhile hard links could also fulfill this task. Figure 4.1 shows three versions of the file foobar and a soft link pointing to the latest version foobar#2.

Representation of the latetest file version
Figure 4.1 - Representation of the latest file version
Removal of the latest version
Figure 4.2 - Removal of the latest version

In order to keep this link functional it has to be updated by the file system if the latest version was deleted or a new version was created. This seems to be a simple task. You could think that decrementing or incrementing the version number would fully suffice. But if the second largest version has been deleted in the meantime or the version which was altered was not the latest one this strategy would fail.

This problem is solved by the function version_find. It scans the whole directory for files with the same base name. For this task it uses the function readdir together with the handler version_on_file_found. For each directory entry the handler is invoked. Instead of just adding the file name to a buffer, version_on_file_found separates the version number from the base name using the previously mentioned function version_separate. If the name matches the target name it compares the version number with previously found version numbers which are stored in the struct name_and_version. In this way the largest, second largest and smallest number as well as the over all found versions are stored. The last two values are needed to check whether the version limit (DirMaxV) is exceeded. In this case the oldest version must be removed.

Figure 4.2 gives an overview of all functions involved. In this diagram every frame represents a function call. If further functions are invoked inside the current one they appear indented. Sequential calls are aligned at the same column. Functions which originate from the vanilla kernel are colored green whereas new introduced functions are colored yellow.

4.4 Copy-On-Write (COW)

The idea behind the copy-on-write approach is that whenever changes are written to a file a new version is created without modifying the current one. In order to enable the Linux kernel to support such a behavior it is clear that some file system related system calls have to be modified.

The first candidate which seems suitable is sys_write. This system call is invoked on a opened file descriptor together with a buffer containing the bytes to be written. The drawback of this approach is that sys_write might be invoked several times for one write back process if the amount of data exceeds the buffer size. This would lead to many different versions and therefor this approach has been discarded.

The second thought was to use the VMS approach which creates a new version if a file is closed. In Linux this is done by the system call sys_close. The disadvantage of this strategy is that a temporary file is needed to absorb all intermediate changes in order to prevent the original version from being altered. This might be appliable for an operating system supporting such a feature from the very beginning. But in Linux the user mode applications assume that they work on a single and complete file.

A workaround for this problem is to modify the system call sys_open as well. It could create the temporary file which would later be renamed to the new version by sys_close. In this scenario the renaming of the temporary file can be save if the open call itself is creating the new version. This thought leads to the finally chosen approach which passes the task of creating a new file to sys_open. If it is invoked in combination with the write flag a file descriptor of a new version is returned instead of the original addressed one.

The modified system call sys_open
Figure 4.3 - The modified system call sys_open

Figure 4.3 shows the main functions of the modified system call. At first an available file descriptor is determined by get_unused_fd. Afterwards the kernel specific representation of a file in terms of a file struct is returned by filp_open. Finally the file struct and the descriptor are tight up and stored process specific by fd_install.

In filp_open all the major work is done. Initially the function open_namei is called which uses the function lookup_hash to resolve a nameidata structure for the given file name. Then may_open is invoked. Aside from checking permission it calls the function do_truncate if the flushing of the file is requested. This is the first place which had to be modified since with versioning switched on the original must not be modified. On this account the flag O_TRUNC is removed before may_open is invoked. In order to still fulfill the request of an empty file in this case no content is copied from the original file to the new version.

By default the second function called by filp_open is dentry_open. Its task is to return the important pointer to the previously mentioned file struct. This function was replaced by version_dentry_open which contains major parts of the implementation of VVFS.

At first version_latestfile is invoked which uses version_find and version_concatenate to determine the file name for the new version. Afterwards a new directory entry is created by open_namei. Then dentry_open is called twice to get a file pointer for the old and the new version. Subsequently these pointers are passed to version_copy which copies the data (if the truncate flag was not specified).

In closing, the version link is updated and the oldest version is deleted if necessary. For the assembling of the name of the oldest version version_delete can make use of the name_and_version struct which was filled by version_find previously.

4.5 Copy-On-Rename

Many tools and applications are using files or buffers to store temporary changes. One observable behavior is to work on a temporary file and renaming it later on to the original addressed file. In this case the creation of new versions is avoided and the previous file is overwritten with new data.

Another scenario, which can be observed in combination with the editor VIM, is the usage of buffers. At first the file is opened in read mode and copied into a buffer. All changes are made to this buffer. When the file is saved the original file is renamed for backup reasons. Then a new file with the original name is opened in write mode and the buffer is copied to this file. In this case all information about the previous version is lost during the rename process.

In order to respond to these issues a copy-on-rename mechanism has been implemented.

Copy-On-Rename (overview)
4.4a - Copy-On-Rename (overview)
Copy-On-Rename (detail)
4.4b - Copy-On-Rename (detail)

Figure 4.4 displays parts of the system call sys_rename which is responsible for the rename process. Given two file names sys_rename uses the function do_rename to determine the corresponding directory entries. With the help of these entries it is possible to decide whether the directory, each file name belongs to, supports versioning or not.

If the source directory has this feature the rename call must be prevented from unlinking the source file. This task is handled by the new function version_src_rename. If only the destination supports versioning a new version must be determined for the target file name. If neither the source nor the destination directory has enabled version support the unmodified vfs_rename is used.

The function version_src_rename initially opens the source file. If the destination directory has no version support the target file name can be opened as well. Otherwise a new version number must be determined. This task is similar to the sys_open procedure with the exception that no content from a previous version is copied. Because of this similarity the function version_dentry_open can be used again. If two opened file pointers exist the content of the source can be copied to the destination by version_copy.

In contradiction version_dst_rename copies no data. Instead only a new version is determined by the known function version_latest_file. Then the work can be transfered to vfs_rename. Only the version link has to be updated afterwards.

4.6 User mode tools

In order to enable the user to control and maintain the versioning file system two user mode tools have been developed.

4.6.1 Manage the maximum number of version -- dirmaxv

The purpose of dirmaxv is to read and set the maximum number of file versions allowed for a directory. It may be called the following way:

dirmaxv [-s number] <directory>

It is mandatory to specify the directory dirmaxv is invoked on. If the optional parameter -s together with a number is passed this value will be set for the new version limit. Otherwise the current value of DirMaxV is returned.

The implementation of this tool is rather simple because all it does is to call the three standard library functions open, ioctl and close. The first one returns a file descriptor for the specified directory. Then the system call sys_ioctl is invoked together with the parameter FIGETMAXVERSIONS and FISETMAXVERSIONS depending on whether DirMaxV should be read or set. Finally the file descriptor is released by the close routine.

4.6.2 Cleaning up with purge

While working with a versioning file system new files are created and older ones are deleted. This leads to increasing and fragmented version numbers. In order to clean up a directory the command purge (known from VMS) has been implemented.

purge [[-k keep versions] | [-d delete versions]] <file>

Purge is invoked on a file name or more precisely on a base name of a file. It fulfills two task. First, it deletes all unnecessary versions. Second, it resets the version numbers of all remaining versions beginning with 0 and the oldest file. All gaps are removed.

If purge is called with no further parameters all versions except the latest one are deleted. The parameter -k specifies how much versions should remain. With the help of -d a particular version or a range of versions can be marked for removal.

The implementation uses a three phase approach. In the fist step all version numbers of a file are collected in a double linked list. Every new item is inserted in order of its value. Then this list is iterated downwards or upwards depending on whether k versions should be kept or d versions should be deleted. In this step all unwanted versions are deleted. The last step resets the version numbers. This is done by iterating the list upwards and incrementing a counter variable.

Example of a purge call
Figure 4.5 - Example of a purge call

Figure 4.5 shows the execution of purge on the three versions of the file foobar. After the call the latest two versions remain with new assigned version numbers.

top

5. Performance

This chapter describes the performance evaluation of the versioning extensions. The focus of the investigation lies in the time needed by the system call open to create a new file version. This mainly depends on the number of files within the current directory and the file size.

5.1 Environment and test setup

All measurements have been executed on a Intel Pentium 4 CPU with 3.06GHz and 512 KB Level 2 Cache. Furthermore the test machine was equipped with 512 MB RAM and a 80 GB SATA hard disk. The kernel used was build from the 2.6.12 vanilla sources with and without the VVFS patch.

All unneeded applications and especially daemons like Cron and SSH had been shut down. For every test case 24 recurrences have been made. The first and second result as well as the two worst ones have been discarded. So the final result was computed from the average of the remaining 20 values.

5.2 Number of file entries

Whenever a new file version is created the whole directory is scanned for files with the same base name in order to determine the latest version number. The first test addresses that problem and evaluates the performance development in relation to the number a files present in a directory.

Latency in dependency of the number of directory entries (linear)
Figure 5.1a - Latency in dependency of the number of directory entries (linear)
Latency in dependency of the number of directory entries (logarithmic)
Figure 5.1b - Latency in dependency of the number of directory entries (logarithmic)

Both diagrams are representing the same data with the difference that the first one (5.1a) uses a logarithmic scaled axis for the latency values.

The magenta line in 5.1a shows the latency measured with the vanilla kernel without any modifications. The average value lies slightly below one microsecond (around 0.8). This value is constant because the test can make use of the directory cache. The blue line represents the VVFS extended kernel with disabled versioning for the current directory (DirMaxV=0). Hereby 1.2 microsecond is the average value. The difference between both lines results from the time needed to determine the value of DirMaxV through the function getmaxversions.

The green and the red graph are showing the behavior of a VVFS patched kernel with enabled versioning. In addition to the green graph the red one represents the case that with the creation of the new version the version limit is exceeded. Therefor the oldest version has to be deleted. Figure 5.1b shows that both graphs are growing linear. If the directory contains more than 10,000 entries the latency exceeds three milliseconds.

The reason for such a bad scaling factor is that the processing of each entry involves two string copies, one string compare, searching within a string and the transformation of a string to an integer.

Nevertheless the red graph is growing just slightly more than the green one. That is because with the determination of the latest version the oldest one is collected as well.

5.3 File size

With the creation of a new version the content of the current version must be copied. Therefore the latency of the open system call is growing with an increasing file size.

Latency in dependency of the file size
Figure 5.2 - Latency in dependency of the file size

Figure 5.2 compares the time needed for a user mode copy process (red graph) with the open call in the VVFS kernel without exceeding DirMaxV (green graph). With a increasing number of copied blocks the VVFS graph gains a performance advantage. The reason for this behavior is the increasing number of context switches needed for the user mode test.

5.4 Conclusion

The performance evaluations have shown that it is very expensive to determine the latest file by searching the directory for all available versions. With an increasing number of directory entries the latency is growing linear. On the other hand this tests have proven that coping data in kernel space is slightly faster than doing the job out of the user mode.

Nevertheless the performance issue has to be addressed otherwise the execution of the system call open with the write flag will bother the kernel thread for a while. On Non-SMP machines (CONFIG_SMP=n) with disabled kernel preemption (CONFIG_PREEMPT=n) the whole system will block during the system call.

top

6. Conclusion and Future Work

This chapter briefly outlines future work and summerize the current project state.

6.1 Future Work

The state of project VVFS is pre alpha and can be considered as a proof of concept implementation. The following itemization addresses major topics of future development:

  • Thread safety - The current implementation can not be considered as thread safe.
  • Performance - Right now the whole directory has to be iterated in order to determine the smallest and largest version number. Future implementations should make use of meta data structures.
  • Incremental versions - Instead of creating a plain copy whenever a new version is constructed only the differences between two versions should be stored.

6.2 Conclusion

The virtual file system of the Linux kernel has been extended by a copy-on-write versioning mechanism based on the model of VMS. The main changes include the introduction of the ioctl commands FIGETMAXVERSION and FISETMAXVERSION as well as major modifications of system calls like sys_open, sys_rename, sys_unlink and their subroutines.

A new file version is created if sys_open is invoked together with the write flag. The latest version is represented by a soft link. If the specified version limit is exceeded by the new version the oldest one is deleted automatically. Aside from kernel modifications two user mode tools have been developed to support the user dealing with this versioning file system.

VVFS was proved and tested with some applications like VIM and The Gimp.

7. Download

You can download the kernel patch from SourceForge.net .

A complete description of the project (Semesterarbeit) is available in german.

8. Literatur

  • [3DFS] D.G. Korn and E. Krell. The 3-D File System. In Proc. of the USENIX Summer Conference, pp. 147-156, 1989.
  • [CFS] D. K. Gifford, R. M. Needham and M. D. Schroeder. The Cedar File System. Communication of the ACM, pp. 288-298, 1988.
  • [ElephantFS] D. S. Santry, M. J. Feeley, N. C. Hutchinson, A. C.Veitch, R.W. Carton and J. Ofir. Deciding When to Forget in the Elephant File System. Proceedings of the 17th ACM Symposium on Operating Systems Principles. ACM Press. pp. 110-123, 1999