Next: References Up: Tape Archiving Using the Previous: Conclusion
TCFS Format Specification

This is an unpublished internal group memo by Alan Bawden discussing the specification for the Time Capsule File System in further detail.
                       The Time Capsule File System
                       ============================


Introduction
============

The goal is to put files in a format so that they may be preserved
indefinitely.  To achieve this goal, the problems to be addressed are
actually very similar to the problems faced in designing the Internet:

In the Internet domain, the fundamental problem is how to communicate
information from place to place in a heterogeneous network environment.  To
solve this problem, the Internet Protocol (IP) specifies some simple
abstractions (IP packets, IP addresses, etc.) that can be supported on
almost any network hardware between hosts running very different operating
systems.

In designing the Time Capsule File System (TCFS), the problem is how to
communicate information through -time-, rather than through space.  This is
an even more heterogeneous environment than IP must cope with, as we can
only -guess- what kinds of equipment the people in the future will be
using.  To solve this problem, TCFS specifies some simple abstractions that
we believe will be easy to support far into the future.

The basic abstraction is the Time Capsule File (TCF).  A TCF is a sequence
of bytes that contains a file bundled together with all of the other
information relevant to that file, such as the file's name, creation date,
author, the machine it was "dumped" from, when it was dumped, etc.  TCFs
also contain their own length and a checksum so that they make minimal
demands on the storage media they may have to inhabit during their journey
into the future.

Design Goals: * easy to recognize the parts of the format
              * extensible when we wish to add new fields

TCFS resembles Unix "ar" and "tar" format -- except without the bugs.

You concatenate TCFs together to form time capsules.

Each TCF is a self-contained entity -- if a collection of TCFs are all
created at the same time to form an archive of a file system hierarchy,
each individual TCF contains all the information necessary to reconstruct
the context it was taken from.

We only assume that file systems of the future will be capable of storing
sequences of 8-bit bytes.  We are biased towards systems that store
characters in those bytes using the ASCII encoding, but nothing depends on
the survival of ASCII.  We are also biased towards English.  (Future
digital historians will probably have to be familiar with ASCII and
English, even if they don't use it themselves.)  

If you prepare a large time capsule (by writing a tape, for example), you
should include the most recent copy of this document.  Including some
relevant source code (which?)  would be good too.  Rosetta stones for ASCII
and English should also be designed and included.

It is recommended that TCFs -not- be subjected to any form of
data compression without taking steps to insure that the decompression
algorithm will remain known.  (Will future digital historians know the
format used by Unix's `compress' program?)

What happens if the concept of a "file" stops being meaningful?

Fields
======

Each TCF is made up of a sequence of fields.  Each field contains a name
and an associated value.  Some fields specify some attribute of the
enclosed file.  (For example, the "Written" field records the date the
file's contents were last modified, and the "Data" field contains the
file's actual contents.)  Other fields specify attributes of the TCF
itself.  (For example, the "TCF-Length" field records the length of the
TCF, and the "TCF-User" field records the name of the person who created
the TCF from the file.)

Comparison of field names, and other tokens called for in this document, is
case insensitive.  (Of course things like file names from case sensitive
file systems must remain case sensitive.)

In the grammar that follows, all characters are explicitly taken to be
ASCII.  This is different from the normal grammars you read that describe
programming languages or mail headers (to name two examples).  In those
grammars, an "A" can be an ASCII "A" or an EBCDIC "A", depending on the
environment.  But for the purposes of the Time Capsule File System, "A"
always means ASCII "A" (101 octal).  A program that runs on IBM mainframes
that makes TCFs has to use ASCII for all the field names.  Of course the
"Name" field and the "Data" field can still -contain- pathnames and data in
EBCDIC if that is appropriate.

Note that an effort has been made to keep TCFs readable text files.  If
someone without a clue about TCF format reads a TCF into a text editor, she
should stand a pretty good chance of recovering the original data by just
reading the text and guessing what everything means.  In particular, if an
ASCII text file is made into a TCF, then the resulting TCF will also be an
ASCII text file.

There are two forms in which a field can be written.  In the first form a
field consists of a field name, a colon (":"), the value of the field,
and a terminating line character (-either- ^J or ^M).  Note that in this
case the field value cannot contain either a ^J or a ^M.

In the second form a field consists of a field name, a semicolon (";"), a
sequence of decimal digits that specify the length of the field's value, a
colon (":"), and finally the value of the field.  Note that in this case,
the field value can contain arbitrary 8-bit data, since the length is
explicitly specified.  In a typical TCF, only the data field is written in
this form.

Some initial whitespace characters are allowed before the field name.

A complete TCF starts with a recognizable string, followed by a sequence of
fields.  There is no restriction on the order of the fields, except that
the TCF-Length field must be the first one.

Field Syntax
------------

  <Alpha> := "A" | "B" | ... | "Z" | "a" | "b" | ... | "z"
  <Digit> := "0" | "1" | ... | "9"
  <Alnum> := <Alpha> | <Digit>
  <Space> := " " | ^I | ^J | ^K | ^L | ^M
  <LineChar> := ^J | ^M
  <NameChar> := <Alnum> | "-"
  <Name> := <NameChar>+
  <Value> := ":" (<Char> - <LineChar>)* <LineChar>
           | ";" <Digit>+ ":" <Char>^n
  <Field> := <Space>* <Name> <Value>
  <TCF> := "[  -----------------------  "
           "Begin Time Capsule File"
           "  -----------------------  ]"
           <Field>*

Individual Fields
-----------------

Note that not all of these fields need be present.  This catalog is as long
as it is only so that we may all agree what a field means when it is
encountered.


TCF-Length:<Integer>    Length of the entire capsule, starting from the "["
                        in the recognition string.

                        This field must be present and must come first.

TCF-Checksum:<Chars>    Chosen so the the entire capsule (as measured by
                        the TCF-Length field) has a 32-bit CRC of 0.

                        This field must be present.  Typically, this field
                        will be last.

TCF-Date:<Date>         The date that this TCF was made.
                        The meaning of some other fields are necessarily
                        relative to this date.  See Date Syntax below.

                        This field must be present.

TCF-Host:<Hostname>     The host where this TCF was made.
                        <Hostname> is relative to TCF-Date (see Host Names
                        below).  Other fields are relative to TCF-Host.

                        This field must be present.

TCF-User:<Username>     The user who made this TCF.
                        <Username> is relative to TCF-Date and TCF-Host (see
                        User Names below).

TCF-Type:<Type>         What type of TCF this is.

                        The type "Archive" means that this TCF contains a
                        file captured from some file system in order to
                        preserve it.  "Archive" is the only type there is,
                        for now.  Someday we might put things in TCFs other
                        than files we want to preserve.

                        Programs that -read- TCFs must check the TCF-Type
                        field to be certain that they properly understand
                        the rest of the TCF.

                        This field must be present.

Capture-Date:<Date>     The date that the contents of this TCF was captured
                        from its original location.  For example, if this
                        TCF was recovered from a dump tape, this will be
                        the date the tape was written, while TCF-Date will
                        be the date that the TCF was made from the tape.

                        Defaults to TCF-Date.

Capture-Host:<Hostname> The host that the contents of this TCF was captured
                        from.  In the case of a dump tape, this will be the
                        machine that wrote the tape, while TCF-Host will be
                        the machine that read the tape and made the TCF.

                        Defaults to TCF-Host.  <Hostname> is relative to
                        Capture-Date (see Host Names below).

Capture-User:<Username> The user that captured the contents of this TCF.
                        In the case of a dump tape, this will be the hacker
                        who wrote the tape, while TCF-User will be the
                        hacker that read the tape and made the TCF.

                        Defaults to TCF-User.  <Username> is relative to
                        Capture-Date and Capture-Host (see User Names below).

Archive-Title:<Title>   If this TCF is part of a logically related set of
                        files, the entire collection may have been given a
                        descriptive title.

System:<Sysname>        ITS, Multics, Tops-20, Lisp-Machine, Unix, VMS,
                        MacOS, DOS, etc.

                        TCF-Date relative, since the authority who assigns
                        system names may change over time.  Currently that
                        authority is the same as the authority who assigns
                        host names (see Host Names below).

                        The meaning of some fields are system dependent.
                        In particular, the System field determines the
                        algorithm for recovering the original data from the
                        rest of the fields in the capsule.  Typically this
                        involves interpreting the contents of the Data
                        field.

                        When the TCF-Type is Archive, this field must be
                        present.

Tape-Info:<Info>        If the contents of this TCF are also contained on a
                        dump tape, this field describes that tape.  The
                        format of <Info> is system dependent, since
                        different systems have different conventions for
                        labeling tapes.

                        For an ITS tape <Info> contains:

                          <Name>, <Type>, <Tape#>, <Reel#>, <File#>

                        Note that the naming scheme for labeling tapes is
                        probably relative to the Capture-Host, and perhaps
                        to the Capture-Date as well.

Tape-Location:<Text>    Contains information about the physical location of
                        the dump tape in the Tape-Info field, at the time
                        the file was copied into this TCF.  For example,
                        an ITS tape stored off-site at InStar might have a
                        Tape-Location field of "InStar MCG06465".

Root:<Pathname>         If this TCF is part of a sub-hierarchy of a file
                        system that was captured all at once, this is the
                        pathname of the root of that sub-hierarchy.  This
                        must not be a relative pathname.  By computing the
                        -difference- between this pathname and the contents
                        of the Name field, an isomorphic copy of the
                        captured sub-hierarchy can be reconstructed later
                        if need be.

                        Defaults to the root of the file system on the
                        Capture-Host.

Name:<Pathname>         The name that the data was given at the source.

                        The interpretation of this field is system
                        dependent.  The pathname is given in the native
                        pathname syntax for the given system (using native
                        quoting conventions if needed).

                        This may -not- be a relative pathname.  (See the
                        Root field.)

                        When the TCF-Type is Archive, this field must be
                        present.

Created:<Date>          Creation of the file itself.  (The Mac file system
                        has this.)  If not present, taken to be the same as
                        the Written date.  See Date Syntax below.

Written:<Date>          Creation of the -contents-.  (Unix mtime, ITS
                        cdate.  Virtually every file system in the world
                        has this.)  This is the date you -really- want.
                        See Date Syntax below.

                        When the TCF-Type is Archive, this field must be
                        present.

Changed:<Date>          The last time -any- aspect of the file changed.
                        (Unix ctime.)  If not present, taken to be the same
                        as the Written date.  See Date Syntax below.

Accessed:<Date>         The last time anybody looked at the file.  (Unix
                        atime, ITS rdate.)  If not present, taken to be the
                        same as the Changed date.  See Date Syntax below.

Author:<Username>       The person primarily responsible for the contents
                        of the file.  (Unix owner, ITS author.)  See User
                        Names below.

Reader:<Username>       The person who last -read- this file.  Tenex had this.

Group:<Groupname>       Unix group.  Hopefully as a name, not a number.

Mode:<Octal>            Unix mode -- as an octal number.

Reference-Count:<Number>
                        The reference count of this file or directory.
                        (Unix has this.)  Defaults to 1.

Pack:<Number>           ITS disk pack.  (Yes, ITS pack numbers were
                        customarily written in decimal.)

Account:<String>        Tops-20 account.

Kept-Versions:<Number>  Tops-20 and LMFS generation retention count.
                        
Expunge-Interval:<Seconds>
                        For Tops-20 and LMFS directories.  The number of
                        seconds between automatic expungings.

Expunged:<Date>         For Tops-20 and LMFS directories.  The time when
                        the last expunge took place.  See Date Syntax below.

Flags:<Flags>           <Flag>, <Flag>, <Flag>, ...
                        Where <Flag> is one of:

                        Deleted         For file systems supporting 
                                        soft deletion.

                        Dont-Delete     It was an error to delete this
                                        file.

                        Dont-Dump       This file was not supposed to be
                                        dumped.

                        Dont-Reap       This file was to be considered
                                        valuable (ITS "$" flag).

                        Not-Dumped      This file had not been dumped (ITS
                                        "!" flag).

                        Offline         This file had been moved to
                                        archival storage.

                        Temporary       This file was marked as temporary.

Comments:<Text>         Additional information of any form.  May appear
                        more than once.

Type:<Filetype>         Directory, File, Link, Hard-Link, ...
                        Defaults to File.

Byte-Size:<Number>      The size of the bytes that make up the file (in bits).
                        Defaults to 8.

Block-Size:<Number>     The size of the blocks that make up the file (in bits).

Length:<Number>         The length of the original file measured in bytes.

Blocks:<Number>         The length of the original file measured in blocks.
                        If this field is present, then Block-Size must also
                        be present.

Data:<Data>             The actual contents, encoded somehow as a
                        sequence of bytes.

                        The algorithm for converting this back into its
                        original form is system dependent (and may require
                        data from other fields), but an effort should be
                        made to keep the data in a form where files that
                        were considered text at the source are encoded here
                        as readable (or close to readable) ASCII.

Link-To:<Pathname>      If this is a link rather than an ordinary file,
                        this is the pathname that is the target of the
                        link.  This might be a relative or an absolute
                        pathname.

                        This is used for both hard and soft links.  In the
                        case of a hard link, this is the name under which a
                        previous TCF of this file was created (in the same
                        collection).

Directory-Listing:<Listing>
                        If this is a directory rather than an ordinary
                        file, this field contains an ASCII listing of the
                        contents of the directory in some readable fashion.

                        The Data field may or may not also be present.  The
                        Data field, if present, contains some system-
                        specific (perhaps binary) representation of the
                        directory, while the Directory-Listing field
                        contains something future historians can read
                        without having to interpret the Data field.

Host names
----------

The naming scheme for hosts may change over time, and so host names are
always relative to some date.  The current (in 1992) rules are:

1.  Hosts are named using fully qualified domain names.

2.  Arpanet hosts from the days before the domain name system are named
    using their Arpanet host names.

3.  For machines that don't have names assigned by either of the rules
    above, names will be assigned by a single naming authority.  Currently
    that authority remains with the author of this document (Alan Bawden).

In the future, naming authority may be further delegated.

It is quite likely that the concept of a host will cease to be meaningful
someday fairly soon.  We'll cross that bridge when we come to it.

User Names
----------

The naming scheme for users may also change over time, and so user names
are also date relative.  

Currently (1992) there is no user naming system in place that is
independent of host naming, so for now a user name is always relative to
some host name.

Some systems assign a number (e.g. Unix's "uid") to each user.  Such
numbers are allowed to appear as user names in TCFs, but this should be
avoided unless it is absolutely necessary.  In fifty years it will be a lot
harder to figure out who "5098" was than who "alan" was.

Date Syntax
-----------

  <Date> := <Day> " " <Time> | <Day>

  <Day> := <DD> " " <Mon> " " <YYYY>
  <DD> := "1" | "2" | ... | "31"
  <Mon> := "Jan" | "Feb" | ... | "Dec"
  <YYYY> := "1900" | "1901" | ...

  <Time> := <HH> ":" <MM> ":" <SS> " " <Zone>
  <HH> := "00" | "01" | ... | "23"
  <MM> := "00" | "01" | ... | "59"
  <SS> := "00" | "01" | ... | "59" | "60"
  <Zone> := ( "+" | "-" ) <HH> <MM> | "Local"

This is actually more likely to be portable into the future than something
like number-of-seconds-since-1-Jan-1900.  Note that timezones are not
named, but are always given as the offset from GMT.  (Where "-0500" is EST,
"-0400" is EDT, "-0800" is PST, "-0700" is PDT, "+0000" is GMT, etc.)  A
time zone of "Local" indicates that the time zone was unknown -- some files
in some file systems may only be marked with wall clock time.

The BNF doesn't capture all the constraints.  The <Day> must make sense for
the <Month> and <Year>.  <SS> can be larger than "60" if there is a leap
second, or "59" can be skipped if there is an anti leap second.

The <Year> cannot be abbreviated (as in "92" instead of "1992").

This format has been chosen to be similar to RFC822 date format, but it is
not identical.

Checksum
--------

Each TCF is formed so that when considered as a polynomial over the
integers mod two, the entire TCF is congruent to 0 modulo

  x^32 + x^7 + x^3 + x^2 + 1.

This is done in bigendian order.  That is, the very first byte in the TCF
is taken to be the highest order 8 terms in the polynomial, and the very
last byte is lowest order 8 terms.  The lowest order bit of the last byte
gives the coefficient for x^0 term.

(This is a fairly standard 32-bit CRC that any textbook on error detecting
and correcting codes will explain in better detail than I can do here.)

(This particular 32-bit CRC was chosen not because of its error correcting
properties (which are not needed for this application), but because it is
particularly fast to compute one byte at a time.)
boogles@martigny.ai.mit.edu