File Staging

This document describes a rough outline of a portable file staging system, that would work with stubs.

Every computer user will, at some point in time, run out of disk space. Running out of space is particularly annoying if you have files that you are never using, and if you have another system where there is enough space left, but you are not using it for some reason, eg, because it is a different platform.

Staging works very simple; you put unused files on removable media or on another system, and replace the file with a stub. The stub will be small, occupying just 1 block of space, but will be a place-holder for the original file. The user sees the file is present in the directory using regular 'ls'. All the usual directory management commands (cp, mv, rm, ln, chmod, chown) work as expected. The only thing the user can not do is use the data contained in the file, as the data has been stored elsewhere -- on removable media or on a different system. If the user wishes to use the file, she should issue a command to get the file from secondary storage. (There are systems that can do this automagically for the end-user. It requires special hooks in the operating system, specifically in the open system call).

fput: Put the file in secondary storage, and replace the existing file with a stub.
fget: Get the file from secondary storage, and replace the stub with its original.

Staging files off may require root privileges, eg. if the file will be moved to tape. It is likely that the file is moved to some place where the original ownership of the file has no meaning. Generally, all staged off data is owned by root or a datamgr user. The original ownership, as well as the file times and mode, are placed on the stub. If the owner or mode of the stub is changed, everything will work as expected, because the stub represents the original file. When the original is staged back using fget, it will apply the owner, mode, and times of the stub to the restored original.

Because root privileges may be involved, many staging systems work with a daemon that runs with the appropriate privileges. The fput and fget commands merely do requests to the daemon, which will handle the given requests.

Files can be staged off to anywhere you like; another disk, to tape, to a WORM drive or it could send the data to another system using FTP or rcp. In any case, the copying program should do checksumming to ensure that the data is transferred properly. When copying to rather unreliable media, such as floppy or tape, it would be wise to keep multiple copies. Many staging systems maintain a database to register where the copies of data are, but it is not impossible to do this without a database. There are not really any fewer risks of loss of data when maintaining a separate database (!), it just involves doing bookkeeping, which can be convenient at times (eg, when doing tape merges).

If a user wishes to remove her file, she issues the rm command, as usual. The stub can be restored from backup if she wants to have back her file.
Over time, old backups expire, and old copies of staged data will have to be cleaned up as well. In order to do so, the staging system runs an auditing session in which it checks if the corresponding stubs still exist. This can only be done easily if the stubs quickly can be identitified as such, which is in practice only possible when the stub is something special like:
- adding an extension: ".mig" or ".stub" files
- a (dead) symbolic link, pointing to "MIGRATED: <database id>"
- an i-node containing additional information, like with XFS or NC1FS. This is non-portable.
- putting an ID in the stub file (this would be very slow)

The most elegant option is the one of the symbolic link. Be aware that you should NEVER run a command on your system that cleans up dead symbolic links, as this would ruin your staged files. The symbolic link approach requires you to keep a database on the system, containing some meta-data such as file-id, place where to find the staged data, and a checksum.
If you decide to use files instead of symbolic links for stubs, I suggest you use a format similar to this:

#! /bin/sh

/bin/cat <<EOF
#####################################################################

This file has been staged off to secondary storage.
Do NOT edit or modify the stub file. Instead, use the
'fget' command to get the original file from secondary storage.

#####################################################################
EOF
exit 255

File ID: 456b9afd
Size: 6914354
Staged-to: tape1 tape2
Checksum: dcaba5c92d8fd8f60fdc2a722b6d9a7a
It might not be a bad idea to write the stub in binary format rather than ASCII, because users will then sooner refrain from editing the stub.

The checksum should not only be a checksum of the original file, but also of the stub content, to prevent tampering with the stub. It is wise to include a checksum of a secret string as well, to keep users from creating fake stubs.

If staged data is stored remotely, you should think about security as well. The data transfers should at least be passworded. Because it is not very user-friendly to have your users type in a password at every fget or fput, using credentials might be a very good option. Credentials are randomly generated temporary passwords that are kept at both the client and server. A credential is generated after successful authentication, and expires after a given amount of time (4 to 8 hours would be a good choice). There should also be a command to let the user destroy her credentials if she wishes to.

In case you are enthousiastic and are going to employ file staging on your systems, always mind to test the system well. You will be storing lots of valuable data into the system over the years to come. Also mind to inform your users well. It is important that they know what is happening to their files, and how they can access their data if they need it.


If you really must, you can contact the author at walter at heiho dot net