Background

When Smalltalk was chosen as the foundation for Merlin in 1984, its particular system of persistent storage was considered the major obstacle to making a commercial system. The combination of image/source/change files had a simplicity that made it well suited for the laboratory, but several flaws made it unreasonable to expect normal users to put up with it:

It wasted space on disk since each user needed a separate image file, which would grow ever larger as new applications were filed in, even if every other user already had a copy of that same application.
images aren't very hard to corrupt. A good developer will always have backups and be able to use the changes file to restore even the latest information. A normal user could easily lose everything without the help of some expert, specially if panic sets in.
PC users, back then, were used to simply inserting a floppy to instantly have a new application running in their systems. It didn't seem likely that they would be happy about going though complex installation proceedures (fileIn). All other systems have since evolved to place this burden on the users, so this would seem to no longer be a problem. But Merlin is supposed to go beyond other OSes, so it would still be nice to have a better solution for this.
Since each user had a separate image, the only way to share things was to fileOut in one system and fileIn in the other. If fileOut were extended to handle all objects in the image (as was done in systems like BOSS in ParcPlace Smalltalks) then it could even work, but it would be far too cumbersome.
Since users had separate images, they could only ruin their own things and not effect anyone else. But it would be nice to have some protection against accidents.

So the answer seemed to be to base Merlin on a distributed persistent object store.

Persistent Objects

File systems are second nature to computer experts and power users but are one of the largest obstacles for new users. To someone who is just learning about computers, it is not obvious that you have to load a file from disk to memory to work on it and save it back when you are done. In fact, the very ideas of memory and disk are often confused.

A virtual memory system can help the user interface present a collection of objects which can be directly manipulated by the user without the normal complications. You should be able to turn on the computer, change a text or drawing and turn it off again knowing that tomorrow it will be exactly as you left it, as normally happens in the real world.

The technology to make this work (in particular to make the object structure as fault tolerant as possible) is not very simple. And it can't always stay hidden from the user - if some objects must be moved to a floppy so they can be carried to another system, then obviously location transparency must go.

Programming

A major advantage of a persistent store is a huge simplification for the programmers. Objects can be directly created and used, rather than writing code to do so when the application starts or coming up with some weird file (the "poor man's persistent store"). Self's object model is very concrete and meshes very well with a transparent storage system.

With persistent objects, the programmer can make full use of all available tools to get the job done. He might make a picture with one or two bitEditors and then execute "Hero pict: Form fromUser" to make it a part of the application.

Implementation

Two different persistent storage systems have been designed for Merlin. They are very similar, but the first one uses "direct pointers" while the second depends on the indirection of an object table. Only this second one will be described here.

All information in the system is stored in a single, world-wide "object system". This is divided into a number of "regions" (or segments, or soups, or...), which introduces a two level addressing scheme. Objects in some region refer to each other by their local addresses, while objects in other regions are referred to by inter-region pointers (explained in more detail below).

Each region has a unique global ID. It is composed of the creator's public key plus a time stamp indicating when the region was created. The chances of two different regions having the same ID is essentially nil. Regions may be replicated to enhance performance and availability, which is why the ID gives no indication of where to find it. When Merlin runs on top of another OS, it stores regions as files with the .MRL extension in that system's native file format. These local files are scanned when Merlin starts up and a table associating global IDs to local file names is created (and the boot region is loaded in automatically). If a region whose ID is not found in this table needs to be loaded, then all Merlin systems in the local network are polled to see if they have a replica. If they don't, then a world-wide directory is consulted (much like the DNS system) and a replica is brought over the net. The sum of all regions in the world forms the equivalent to a single image shared by everyone, but since not all replicas might be equally up to date it will rarely (if ever) be totally consistent.

The creator of a region can issue "keys" to allow other users to have read and/or write access to the objects in that region. These keys identify the user, the region, the permissions, a time limit for which the key is valid and a hash value to prove that the key is authentic. Future implementations will use public key cryptography for this, but initially it will just be a one way hash of the key contents plus a secret number stored in the region itself.

The very first object in every region is the top level of a RegionObjectTable. For small regions, a single level table is enough. As stored on disk, objects point to others in the same region by including their object table index. Each object table entry also includes a generation number which is incremented whenever the entry is reused to detect obsolete pointers (which should never happen). To point to an object in another region, the local region must include a valid key to that region as a local object. The inter-region pointer has the index of this key plus the index of the object in the other's region RegionObjectTable. It is possible for the key on which a pointer depends to expire and render the pointer invalid. Before that happens, the system can request that the key be renewed. This request lists which objects in the remote region will be referred to using the new key. RegionObjectTable entries always keep track of the latest expiration date of any remote pointers to the object so it can perform a kind of very slow global garbage collection.

When an object in memory tries to send a message to an object in a "new" region, this region is either mapped into the local address space (with pointer swizzling) if no one in the local network has it opened with a key that has write permission. Otherwise the message is sent to that workstation via the network. Things are much more complicated when users in different networks update the same region at the same time. These changes propagate to all replicas of the region with time, eventually causing a conflict. Automatic conflict resolution is a topic for future research.

Regions are swizzled in one object at a time. First, the page table pages pointing to each object must be brought into memory (if they aren't there already) in a non movable position. For each page table page loaded, an additional area is allocated for containing the physical addresses in RAM (or a special marker for objects which haven't been loaded). Each pointer (intra or inter-region) is patched to point to the object's entry in this additional area, so access to loaded objects is always done via a simple indirection no matter how many levels the object table has. Every time access is attempted on an object that is not yet loaded, the object tables are followed to find the right pages on disk (or RAM, if that page was already loaded due to another object).

Given a memory (indirect) pointer to an object, its page table entry can easily be found. This allows us to find the object's region and disk address, which is needed during unswizzling. This process sometimes leads to the creation of new inter-region pointers (by the way, this creation of new inter-region pointers is the major unsolved problem in the "direct pointers design", but it doesn't cause any complications with object tables).

All changes to objects are saved in a "write log". That way, a crash right in the middle of a disk write is not catastrophic. This makes it possible, but not easy, to roll the system state back to the most recent consistent state before the crash. The write log plus the fact that the original region data remains intact makes it possible to recreate previous states in the current session, allowing a limited systemwide undo. When an user logs out, any regions that were changed are "commited" and garbage collected at the same time. If this local garbage collection frees and object that still has inter-region pointers to it, then the object is tagged for migration to the next region that "touches" it. This eventually collapses inter-region cycles, making the local garbage collector reclaim it when it becomes garbage.

RAM

All this "regions" stuff is for old objects. New objects exist only in memory - only by placing a reference to them in an already persistent object will they be moved to some region and get saved to disk. New objects are dealt with by an in-memory GC.

On each node in a network, there is one system execution context (called the "boot session") and one or more application execution contexts (called "user sessions"). All sessions on a node share a single 32 bit address space, but each one has different read/write permissions for segments or pages. Only the boot session can directly change the memory map and access the disk based regions. User sessions must request service from the boot session directly (by sending a message to a boot session object) or indirectly (by causing a page fault) in order to manage memory.

back to:

please send comments to jecel@lsi.usp.br (Jecel Mattos de Assumpcao Jr)