Persistent Object Storage

This design is intended to meet a number of (possibly contradictory) requirements:

object storage
virtual memory
multiuser protection
protection against accidents in an open system
package distribution
name spaces
versions and infinite undo
replication
distributed computing
disconnected operation

See http://www.merlintec.com/lsi/persist.html for more details on the motivation for a persistent object store system and information on an earlier design.

Sources of inspiration (or at least confirmation of ideas developed independently) include the Amoeba OS, Coda file system, several modular Smalltalk efforts including those in Squeak, Us language, EROS, many persistent store projects such as PerDIS at SOR/INRIA, Xanadu, distributed database projects such as Bayou at Xerox Parc and so on.

The design is not finished - in particular there are some problems with the viewpoint system to allow subjective programming (needed for some of the above requirements) and these will be detailed below in red letters. [The text below will have to be rewritten to be compatible with the new design for efficient viewpoints described in http://groups.yahoo.com/group/self-interest/message/1230]

Context

Most hard disks contain three very different kinds of information all lumped together: things created on the local machine, things copied from other machines and multimedia stuff. Obviously, video and audio could also be separated into locally created and remotely created, but we won't do so here. In fact, we won't consider multimedia data at all from now on, including binary data such as pictures or even just text.

The problem with mixing up locally created data with that copied from other places is that there never seems to be enough disk space. But imagine that we separate them and store all possible versions of local data (as in a logging file system - nothing is ever overwritten) but treat remote information as a fixed size cache (since it can always be copied again if it is needed after being discarded to make space for something else). When will the disk fill up? Never, in practice, for hard disks only last a few years and then must be replaced. When they are, the smallest available model is always much larger than the old one, so a computer's disk space inevitably grows with time. And it grows at a faster pace than the local data generation rate.

It is nice to also grow the local cache size as time goes by, but it only needs to be large enough to contain the "working set" of the local machine and does not need to keep up with the accelerating growth of remote data (the number of remote machines grows every day, so this is not proportional to the local generation rate).

If local data is kept around forever once created, and remote data is being treated likewise somewhere else, then there is no need for any kind of global garbage collection system. This is a key aspect of this design.

Segments

A single logical object actually exists as a set of "pieces", and a group of these pieces (each from a different object) are stored together in a segment (or region, or cluster, or soup...) on disk. Each segment has a globally unique ID, and an object (piece of an object, really, but we will use this term from now on except where the difference is important) can directly point to objects in the same segment. The header of each piece can include a {segment ID, object number} pair to indicate another piece of the same object in a different segment (more on this below). For an object to point to another in a separate segment, it must have a local piece which may be empty.

The pieces that make up symbol objects (in Smalltalk terms) are not directly linked like described above, but are instead matched by their contents. The effect is that of a "virtual shared space" for symbols. In the first version, this content matching will be a simple string comparison but the idea is to move to something like UN's Universal Network Language (UNL) system so code created by programmers with different native spoken languages can work together.

Segments are highly compressed - each object is byte aligned but bit encoded. Pointers to other objects are indicated as small integers: 0 for the same object, 1 for the next object, -1 for the previous object and so on.

Although each object is (in theory) totally independent, they can be grouped into "clone families" to save space. This is done using "maps" which store common information such as slots names and constant values. Only the values of the data slots are stored in the objects themselves. If the shared information ever does need to be changed, the object gets a new map (other objects using the old map continue to use the old map). These maps are used as type information for the compilers to be able to do their optimizations, but they are a purely run time structure and don't exist in the segments. What does exist to help compress the segments is having an object be derived from another local object (this uses a separate pointer from the cross segment pointer described above and below). The second object is encoded normally, but the derived one just lists what differences it has relative to it. Normally they will share a map when brought into memory. When an object is cloned, only the pieces in the current viewpoint (see below) are created and they can be coded as derived from the prototype object in each segment. If a piece of the clone is needed in a new viewpoint, it can be created by deriving from the prototype in that segment as well (if the prototype object does have a piece in that segment).

On a conventional OS, segments are stored as files in a known location (alternatively, a set of them can also be bundled into a single executable file along with the boot code for more convenient distribution). The names of the files are not important and changing them has no effect on system operation. The system scans the directory when starting up and builds a file name to segment ID table. Whenever segments not in this table are needed, they are brought in over the network (first local "peers" are polled to see if they have a replica, then globally known servers are queried).

In Memory

Whenever an object from a segment is needed, the entire segment is loaded into memory in a "compressed storage" area. Then the individual objects are "swizzled" into main memory, being uncompressed in the process. Segments are strictly read only, so the fact that a changed object would be a different size when compressed (unswizzled) doesn't cause a problem since it is not written back in place but must be saved in a new segment.

As the segment is loaded, an object table is created for it. This object table stays in a fixed address as long as the segment is loaded and the address of an entry (in C terms, &ot7[12] for object 12 of segment 7) is considered to be the swizzled virtual address of that object. This allows objects which have not yet been swizzled (or have already been unswizzled) to have a known virtual address that can be used when expanding objects that point to it. Note how easy it is to convert pointers between the compressed representation (cp) and the virtual address (vp) for an object with a known virtual address (va):

     vp = va + (objectTableEntrySize x cp)

Problem: this adds an indirection to each object access, just like the typical "handle" scheme used on the Mac and Windows as well as early Smalltalks with object tables. This is a very bad thing in modern processors since their performance is mostly memory access limited. For the Tachyon processor this is not such a problem because the cache is virtually addressed (the indirection only happens on a cache miss and not on every object access). A more direct design would be better on most processors, however.

The header of the piece points to a piece of the same object in a different segment. We call the first the "derived object" and the latter the "base object". We say that the derived object patches the base object. Whenever the derived object is swizzled into main memory, the base object (and its base object, if it has one) must be too. Problem: in this design each piece of an object has its own virtual address when in memory. That makes testing for identity ('==') more complicated. This would be fatal in the case of symbols, given their role in message passing, so they receive special treatment when building the object table and during swizzling.

Viewpoints

The lookup algorithm for messages sends takes the "current viewpoint" into account. That is simply a pointer to one of the loaded segments. The first object in every segment is the special viewpoint object, so the virtual address &ot7[0] is the low level representation for that viewpoint. Like other object pieces, a viewpoint can also be derived from another viewpoint. This is fixed when the segment is created and can't be changed later.

Most message sends are done in the sender's context, which means that the message receiver's pointer is changed to indicate its piece in the current viewpoint's segment. If there is no such piece, then one in the current viewpoint's base viewpoint can be used (and if there is not one there either, the search may continue until we reach a viewpoint with no base. If the search fails, then the pointer remains as it was). Problem: a naive implementation of this would be terribly slow. The value of the current viewpoint can be changed explicitly during the execution of an expression:

        biasedView do: [some expression]

Sometimes we need the message send to be forced to happen always in a given viewpoint for a certain object. That can be done by a "sticky pointer" to that object. This could be encoded using some bit in the virtual address or, if sticky pointers are rarely used, by having a special proxy piece which points to the real one.

Two different things that you can do with a segment are to replicate it and to copy it. We will talk about replication below. When you copy a segment, you can set it to be derived from a different segment than what the original is derived from. Each of the pieces being copied may also have its derivation changed. Problem: how is this decided? This is a critical issue since changing how objects "look" (by changing their base piece) is the main motivation for creating copies of viewpoints (segments). This is related to the sticky pointer described above and to the relative/absolute references in spreadsheets and other systems. The copy has an entirely different ID from the original.

Replication and Versions

When segment is needed in more than one machine, a replica of it is created. The replica has the exact same bytes as the original, including the global ID. But the first instance of the segment is called the "master replica" and all the others are "slave replicas". A master replica can't be deleted, but slaves can be eliminated at any time. This implements the local/remote data separation described in the project context above. It is possible to convert some slave replica into a master at the same time the master becomes a slave, and this allows the old master to be deleted. Problem: how can this be done atomically?

All segments are read only (actually, append only as long as they have never been replicated). To change an object you must create a new segment derived from the current one and place a new piece of the object with the changed values. This new segment's ID is totally unrelated to the old one's, but a version system keeps track of their special relation. The new segment works very much a like a "write log" in a journalled file system.

The approach described above would have serious performance problems since any logical segment would be broken up into a larger number of physical versions and the objects would be broken up into dozens of little pieces, most patching repeatedly the same information. So an option exists to create a "compacting copy". The resulting segment looks the same as the original, but all pieces are merged into a single new one. We have essentially compressed the history (log) into just the current state. The system should work reasonably well if only compacted segments are replicated and the others are kept private.

A piece can only point to one other piece (a kind of single inheritance) and in the case of a non compacted segment that other piece is always the previous version of the object. This means that objects can have very different addresses in each new version. When a compacted segment is created, its pieces point to whatever the pieces in the previous compacted segment pointed to, short circuiting all the "temporary" non compacted versions. But this means we no longer have a way to know what an object's previous address was, and we will need this information when updating pointers from other segments. The solution is to add a reverse mapping table to the previous compacted version. We eliminate from it any information that is also available in the newly created version. Based on experience with systems such as CVS, the current version plus the set of all previous compacted version (compressed this way) won't take up much more space than two to three times what the current version would need by itself.

In a sense, there is no global garbage collection since any object exists forever once created. But an object that is not in active use tends to stay behind in an early version of a segment while the others move on. The compression described above will mean that an old version of a segment will have a mapping table, some patches to convert newer objects into the prior version and full copies of "presumed dead" objects. These old versions will only exist on one machine (eventually even moved to offline media) and so these objects are effectively removed from the internet. But the probability is not zero that there is a pointer to them lurking somewhere and that someday a computer will try to access them from the viewpoint of a newer version of the segment. In that case we need to migrate them, which will effect the mapping tables and data in the old segment.

It will also have an effect on the new segment - the old object must be added to it. But if this is a slave replica (or a slave replica has already been created), then the segment no longer accepts appends. If it did, the replicas around the world would no longer be identical. A very nasty problem would be if different machines "promoted" different sets of old objects by themselves - the same object could have different addresses in different replicas of the same version of a segment. So each computer creates its own local segment derived from the read only segment and uses that as the place to store promoted objects. Eventually, a replica of this "appendix" is sent to the machine with the original master replica. This machine uses this information collected from all over the world in order to know which objects to promote when creating a new compacted version. It also creates a mapping table for each appendix to the new version so each machine can migrate, thus bringing them all back in sync (at least as much as that is possible in a global system).

Conflict Resolution

It is possible that more than one computer is creating derived segments from local replicas of a single segment at the same time. When these computers are connected by a high speed network, it is worth the effort to try to coordinate these changes. But it is very often that this isn't the case, so versions end up forming a tree structure. But it is very desirable to be able to merge branches back into a single version. The system can automatically combine different viewpoints into a single one for certain combinations. It also provides hooks so that application specific code can handle a few other combinations. When automatic conflict resolution fails, then changes must be manually integrated by the user into a given "main branch".

Reflection

The "R" in Self/R is for "reflection", so it is obviously an important aspect of the system. In previous designs meta-objects were used, but the current idea is to have "viewpoint modifiers" instead. These look exactly like regular viewpoints described above to the programmer, but they are implemented differently. They have no associated disk segment and creating a new one derived from a regular viewpoint is a very lightweight operation. The pieces they give access to are actually stored as extensions of the regular pieces in viewpoint segments. Most objects don't have these extensions and instead have a standard one supplied by the viewpoint modifier.

Where mirrors would be used in Self:

   ((reflect: obj) at: 'color') value

we would have a viewpoint modifier in Self/R instead:

   here seeSlots do: [(obj at: 'color') value]

"here" is the name of the current viewpoint (like "self" is always the name of the current object) and in the example it received the 'seeSlots' message which caused it to create and return a new viewpoint derived from itself and which modifies objects to act as a set of slots.

There are several other reflective modifiers like the one above, but a very crucial one is 'seeParents'. Parent slots are not normally visible like in Self but are hidden inside this special viewpoint modifier. The lookup algorithm has optimized access to them, of course, but regular code must make a special effort to deal with them. There is no need to put a "*" at the end of parent names since they effectively live in their own namespace.

Meeting the Requirements

How well does this design address the requirements listed at the beginning? Let's check:

object storage - certainly the system allows objects to exist on disk while the computer is off.
virtual memory - the logical "image" formed by the sum of all segments is much larger than the computer's main memory. Only a small set of segments are actually loaded at any one time and only a subset of the objects they contain are expanded from the compressed form.
multiuser protection - though there is logically only a single image in the world, users can have their own particular viewpoints. By controlling the distribution of pointers to viewpoint objects, a very interesting "capability based security" system can be created. This also allows code to be executed in a "sandbox" so a user can be protected against damage by untrusted applications.
protection against accidents in an open system - it should be possible to change everything in the system, but this should not happen when it is not intended. A user should not accidently redefine the color "black" and suddenly be unable to read any text on the screen. Using the capability system mentioned above, dangerous changes will require explicitly requesting special viewpoints before they can be made.
package distribution - the logical image is the sum of all segments created in all the machines in the world. But only a small fraction of those are resident in the local disk. When a new segment is created and a program is developed in its context, all the bits needed to run it are either included directly in the segment or in segments pointed to (directly and indirectly) by it. To allows another user to use the application, all that must be done is to give that user the segment ID and to place a replica of the segment some place where that user can download it from.
name spaces - message sends to a single object can have different effects depending on the current viewpoint.
versions and infinite undo - the version system was explicitly mentioned above and it allows a read only peek at any point in the past. Undo is implemented by selectively copying objects from the past into the present state of the system. Note how if I look at yesterday's object in segment A, these will be pointing to yesterday's objects in segment B and not the current version (we interpret past pointers as sticky by default).
replication - this is a basic part of the design and is also described above.
distributed computing - the system would have to be modified to allow remote references, but this would be easy: just allow an empty piece of an object to serve as a proxy for the rest of it on another node. Some other small adaptations would be required, but there is a lot in common with this design.
disconnected operation - the part about merging different branches of object modifications is a key aspect, but another would be to have an agressive "hoarding" of remote segments so that the network will not be needed at all when working unconnected.