This is an unpublished internal group memo by Alan Bawden discussing the specification for the Time Capsule File System in further detail.
The Time Capsule File System ============================ Introduction ============ The goal is to put files in a format so that they may be preserved indefinitely. To achieve this goal, the problems to be addressed are actually very similar to the problems faced in designing the Internet: In the Internet domain, the fundamental problem is how to communicate information from place to place in a heterogeneous network environment. To solve this problem, the Internet Protocol (IP) specifies some simple abstractions (IP packets, IP addresses, etc.) that can be supported on almost any network hardware between hosts running very different operating systems. In designing the Time Capsule File System (TCFS), the problem is how to communicate information through -time-, rather than through space. This is an even more heterogeneous environment than IP must cope with, as we can only -guess- what kinds of equipment the people in the future will be using. To solve this problem, TCFS specifies some simple abstractions that we believe will be easy to support far into the future. The basic abstraction is the Time Capsule File (TCF). A TCF is a sequence of bytes that contains a file bundled together with all of the other information relevant to that file, such as the file's name, creation date, author, the machine it was "dumped" from, when it was dumped, etc. TCFs also contain their own length and a checksum so that they make minimal demands on the storage media they may have to inhabit during their journey into the future. Design Goals: * easy to recognize the parts of the format * extensible when we wish to add new fields TCFS resembles Unix "ar" and "tar" format -- except without the bugs. You concatenate TCFs together to form time capsules. Each TCF is a self-contained entity -- if a collection of TCFs are all created at the same time to form an archive of a file system hierarchy, each individual TCF contains all the information necessary to reconstruct the context it was taken from. We only assume that file systems of the future will be capable of storing sequences of 8-bit bytes. We are biased towards systems that store characters in those bytes using the ASCII encoding, but nothing depends on the survival of ASCII. We are also biased towards English. (Future digital historians will probably have to be familiar with ASCII and English, even if they don't use it themselves.) If you prepare a large time capsule (by writing a tape, for example), you should include the most recent copy of this document. Including some relevant source code (which?) would be good too. Rosetta stones for ASCII and English should also be designed and included. It is recommended that TCFs -not- be subjected to any form of data compression without taking steps to insure that the decompression algorithm will remain known. (Will future digital historians know the format used by Unix's `compress' program?) What happens if the concept of a "file" stops being meaningful? Fields ====== Each TCF is made up of a sequence of fields. Each field contains a name and an associated value. Some fields specify some attribute of the enclosed file. (For example, the "Written" field records the date the file's contents were last modified, and the "Data" field contains the file's actual contents.) Other fields specify attributes of the TCF itself. (For example, the "TCF-Length" field records the length of the TCF, and the "TCF-User" field records the name of the person who created the TCF from the file.) Comparison of field names, and other tokens called for in this document, is case insensitive. (Of course things like file names from case sensitive file systems must remain case sensitive.) In the grammar that follows, all characters are explicitly taken to be ASCII. This is different from the normal grammars you read that describe programming languages or mail headers (to name two examples). In those grammars, an "A" can be an ASCII "A" or an EBCDIC "A", depending on the environment. But for the purposes of the Time Capsule File System, "A" always means ASCII "A" (101 octal). A program that runs on IBM mainframes that makes TCFs has to use ASCII for all the field names. Of course the "Name" field and the "Data" field can still -contain- pathnames and data in EBCDIC if that is appropriate. Note that an effort has been made to keep TCFs readable text files. If someone without a clue about TCF format reads a TCF into a text editor, she should stand a pretty good chance of recovering the original data by just reading the text and guessing what everything means. In particular, if an ASCII text file is made into a TCF, then the resulting TCF will also be an ASCII text file. There are two forms in which a field can be written. In the first form a field consists of a field name, a colon (":"), the value of the field, and a terminating line character (-either- ^J or ^M). Note that in this case the field value cannot contain either a ^J or a ^M. In the second form a field consists of a field name, a semicolon (";"), a sequence of decimal digits that specify the length of the field's value, a colon (":"), and finally the value of the field. Note that in this case, the field value can contain arbitrary 8-bit data, since the length is explicitly specified. In a typical TCF, only the data field is written in this form. Some initial whitespace characters are allowed before the field name. A complete TCF starts with a recognizable string, followed by a sequence of fields. There is no restriction on the order of the fields, except that the TCF-Length field must be the first one. Field Syntax ------------ <Alpha> := "A" | "B" | ... | "Z" | "a" | "b" | ... | "z" <Digit> := "0" | "1" | ... | "9" <Alnum> := <Alpha> | <Digit> <Space> := " " | ^I | ^J | ^K | ^L | ^M <LineChar> := ^J | ^M <NameChar> := <Alnum> | "-" <Name> := <NameChar>+ <Value> := ":" (<Char> - <LineChar>)* <LineChar> | ";" <Digit>+ ":" <Char>^n <Field> := <Space>* <Name> <Value> <TCF> := "[ ----------------------- " "Begin Time Capsule File" " ----------------------- ]" <Field>* Individual Fields ----------------- Note that not all of these fields need be present. This catalog is as long as it is only so that we may all agree what a field means when it is encountered. TCF-Length:<Integer> Length of the entire capsule, starting from the "[" in the recognition string. This field must be present and must come first. TCF-Checksum:<Chars> Chosen so the the entire capsule (as measured by the TCF-Length field) has a 32-bit CRC of 0. This field must be present. Typically, this field will be last. TCF-Date:<Date> The date that this TCF was made. The meaning of some other fields are necessarily relative to this date. See Date Syntax below. This field must be present. TCF-Host:<Hostname> The host where this TCF was made. <Hostname> is relative to TCF-Date (see Host Names below). Other fields are relative to TCF-Host. This field must be present. TCF-User:<Username> The user who made this TCF. <Username> is relative to TCF-Date and TCF-Host (see User Names below). TCF-Type:<Type> What type of TCF this is. The type "Archive" means that this TCF contains a file captured from some file system in order to preserve it. "Archive" is the only type there is, for now. Someday we might put things in TCFs other than files we want to preserve. Programs that -read- TCFs must check the TCF-Type field to be certain that they properly understand the rest of the TCF. This field must be present. Capture-Date:<Date> The date that the contents of this TCF was captured from its original location. For example, if this TCF was recovered from a dump tape, this will be the date the tape was written, while TCF-Date will be the date that the TCF was made from the tape. Defaults to TCF-Date. Capture-Host:<Hostname> The host that the contents of this TCF was captured from. In the case of a dump tape, this will be the machine that wrote the tape, while TCF-Host will be the machine that read the tape and made the TCF. Defaults to TCF-Host. <Hostname> is relative to Capture-Date (see Host Names below). Capture-User:<Username> The user that captured the contents of this TCF. In the case of a dump tape, this will be the hacker who wrote the tape, while TCF-User will be the hacker that read the tape and made the TCF. Defaults to TCF-User. <Username> is relative to Capture-Date and Capture-Host (see User Names below). Archive-Title:<Title> If this TCF is part of a logically related set of files, the entire collection may have been given a descriptive title. System:<Sysname> ITS, Multics, Tops-20, Lisp-Machine, Unix, VMS, MacOS, DOS, etc. TCF-Date relative, since the authority who assigns system names may change over time. Currently that authority is the same as the authority who assigns host names (see Host Names below). The meaning of some fields are system dependent. In particular, the System field determines the algorithm for recovering the original data from the rest of the fields in the capsule. Typically this involves interpreting the contents of the Data field. When the TCF-Type is Archive, this field must be present. Tape-Info:<Info> If the contents of this TCF are also contained on a dump tape, this field describes that tape. The format of <Info> is system dependent, since different systems have different conventions for labeling tapes. For an ITS tape <Info> contains: <Name>, <Type>, <Tape#>, <Reel#>, <File#> Note that the naming scheme for labeling tapes is probably relative to the Capture-Host, and perhaps to the Capture-Date as well. Tape-Location:<Text> Contains information about the physical location of the dump tape in the Tape-Info field, at the time the file was copied into this TCF. For example, an ITS tape stored off-site at InStar might have a Tape-Location field of "InStar MCG06465". Root:<Pathname> If this TCF is part of a sub-hierarchy of a file system that was captured all at once, this is the pathname of the root of that sub-hierarchy. This must not be a relative pathname. By computing the -difference- between this pathname and the contents of the Name field, an isomorphic copy of the captured sub-hierarchy can be reconstructed later if need be. Defaults to the root of the file system on the Capture-Host. Name:<Pathname> The name that the data was given at the source. The interpretation of this field is system dependent. The pathname is given in the native pathname syntax for the given system (using native quoting conventions if needed). This may -not- be a relative pathname. (See the Root field.) When the TCF-Type is Archive, this field must be present. Created:<Date> Creation of the file itself. (The Mac file system has this.) If not present, taken to be the same as the Written date. See Date Syntax below. Written:<Date> Creation of the -contents-. (Unix mtime, ITS cdate. Virtually every file system in the world has this.) This is the date you -really- want. See Date Syntax below. When the TCF-Type is Archive, this field must be present. Changed:<Date> The last time -any- aspect of the file changed. (Unix ctime.) If not present, taken to be the same as the Written date. See Date Syntax below. Accessed:<Date> The last time anybody looked at the file. (Unix atime, ITS rdate.) If not present, taken to be the same as the Changed date. See Date Syntax below. Author:<Username> The person primarily responsible for the contents of the file. (Unix owner, ITS author.) See User Names below. Reader:<Username> The person who last -read- this file. Tenex had this. Group:<Groupname> Unix group. Hopefully as a name, not a number. Mode:<Octal> Unix mode -- as an octal number. Reference-Count:<Number> The reference count of this file or directory. (Unix has this.) Defaults to 1. Pack:<Number> ITS disk pack. (Yes, ITS pack numbers were customarily written in decimal.) Account:<String> Tops-20 account. Kept-Versions:<Number> Tops-20 and LMFS generation retention count. Expunge-Interval:<Seconds> For Tops-20 and LMFS directories. The number of seconds between automatic expungings. Expunged:<Date> For Tops-20 and LMFS directories. The time when the last expunge took place. See Date Syntax below. Flags:<Flags> <Flag>, <Flag>, <Flag>, ... Where <Flag> is one of: Deleted For file systems supporting soft deletion. Dont-Delete It was an error to delete this file. Dont-Dump This file was not supposed to be dumped. Dont-Reap This file was to be considered valuable (ITS "$" flag). Not-Dumped This file had not been dumped (ITS "!" flag). Offline This file had been moved to archival storage. Temporary This file was marked as temporary. Comments:<Text> Additional information of any form. May appear more than once. Type:<Filetype> Directory, File, Link, Hard-Link, ... Defaults to File. Byte-Size:<Number> The size of the bytes that make up the file (in bits). Defaults to 8. Block-Size:<Number> The size of the blocks that make up the file (in bits). Length:<Number> The length of the original file measured in bytes. Blocks:<Number> The length of the original file measured in blocks. If this field is present, then Block-Size must also be present. Data:<Data> The actual contents, encoded somehow as a sequence of bytes. The algorithm for converting this back into its original form is system dependent (and may require data from other fields), but an effort should be made to keep the data in a form where files that were considered text at the source are encoded here as readable (or close to readable) ASCII. Link-To:<Pathname> If this is a link rather than an ordinary file, this is the pathname that is the target of the link. This might be a relative or an absolute pathname. This is used for both hard and soft links. In the case of a hard link, this is the name under which a previous TCF of this file was created (in the same collection). Directory-Listing:<Listing> If this is a directory rather than an ordinary file, this field contains an ASCII listing of the contents of the directory in some readable fashion. The Data field may or may not also be present. The Data field, if present, contains some system- specific (perhaps binary) representation of the directory, while the Directory-Listing field contains something future historians can read without having to interpret the Data field. Host names ---------- The naming scheme for hosts may change over time, and so host names are always relative to some date. The current (in 1992) rules are: 1. Hosts are named using fully qualified domain names. 2. Arpanet hosts from the days before the domain name system are named using their Arpanet host names. 3. For machines that don't have names assigned by either of the rules above, names will be assigned by a single naming authority. Currently that authority remains with the author of this document (Alan Bawden). In the future, naming authority may be further delegated. It is quite likely that the concept of a host will cease to be meaningful someday fairly soon. We'll cross that bridge when we come to it. User Names ---------- The naming scheme for users may also change over time, and so user names are also date relative. Currently (1992) there is no user naming system in place that is independent of host naming, so for now a user name is always relative to some host name. Some systems assign a number (e.g. Unix's "uid") to each user. Such numbers are allowed to appear as user names in TCFs, but this should be avoided unless it is absolutely necessary. In fifty years it will be a lot harder to figure out who "5098" was than who "alan" was. Date Syntax ----------- <Date> := <Day> " " <Time> | <Day> <Day> := <DD> " " <Mon> " " <YYYY> <DD> := "1" | "2" | ... | "31" <Mon> := "Jan" | "Feb" | ... | "Dec" <YYYY> := "1900" | "1901" | ... <Time> := <HH> ":" <MM> ":" <SS> " " <Zone> <HH> := "00" | "01" | ... | "23" <MM> := "00" | "01" | ... | "59" <SS> := "00" | "01" | ... | "59" | "60" <Zone> := ( "+" | "-" ) <HH> <MM> | "Local" This is actually more likely to be portable into the future than something like number-of-seconds-since-1-Jan-1900. Note that timezones are not named, but are always given as the offset from GMT. (Where "-0500" is EST, "-0400" is EDT, "-0800" is PST, "-0700" is PDT, "+0000" is GMT, etc.) A time zone of "Local" indicates that the time zone was unknown -- some files in some file systems may only be marked with wall clock time. The BNF doesn't capture all the constraints. The <Day> must make sense for the <Month> and <Year>. <SS> can be larger than "60" if there is a leap second, or "59" can be skipped if there is an anti leap second. The <Year> cannot be abbreviated (as in "92" instead of "1992"). This format has been chosen to be similar to RFC822 date format, but it is not identical. Checksum -------- Each TCF is formed so that when considered as a polynomial over the integers mod two, the entire TCF is congruent to 0 modulo x^32 + x^7 + x^3 + x^2 + 1. This is done in bigendian order. That is, the very first byte in the TCF is taken to be the highest order 8 terms in the polynomial, and the very last byte is lowest order 8 terms. The lowest order bit of the last byte gives the coefficient for x^0 term. (This is a fairly standard 32-bit CRC that any textbook on error detecting and correcting codes will explain in better detail than I can do here.) (This particular 32-bit CRC was chosen not because of its error correcting properties (which are not needed for this application), but because it is particularly fast to compute one byte at a time.)