Database Engineering

Some time ago I was browsing the web and I came across a page that was actually a mid-term assignment on a university: write a database engine that understands the following commands:
  insert into table values [ ... ]
  delete from table ...
  select * from ...
  update table ...
I thought to myself, "I can do that", but doing it efficiently is not as easy as it may seem. This paper describes ideas on how to write your own high performance database engine.

Some people will ask 'Why?' (some always do), because you can interface with already existing database software, such as Oracle, DB2, PostgreSQL, and many others. While this is true, it does not seem efficient to me to install a heavy-weight database system just to catalogue your CD collection. Or, for example, to register users of a particular online service you are offering. In general, you would like to write an application that uses a database containing only a couple of hundred or perhaps thousand entries. My opinion is that writing dedicated code for this is way faster than using generalized heavy-weight existing database software. I shouldn't have said 'dedicated' here, because what we really want is something more flexible than that; preferably something that can be as flexible as the heavy-weight database systems, but still be small and fast and efficient. Besides, developing stuff by yourself is fun.

First steps: Choosing a database format
One of the first things to do is to define a database file format. We want the datbase format to be easy to understand and easy to port to a different architecture, and therefore we choose ASCII. ASCII is nice. ASCII is good. I like ASCII, because you can view or even edit your files using a regular text editor. Be aware though, that you cannot have any glyphs in ASCII text; no symbols or characters that are not in the ASCII standard. Suppose you were developing a database system for storing Chinese names, then ASCII would have been a very poor choice. The german Ringel-S and the french accent-circumflex are not easily stored in ASCII, although you could resort to using some kind of escape mechanism to describe these characters in ASCII (read on).

Now that we have chosen to use plain ASCII text, we still haven't chosen a database format. A database format is the layout of the database file. Choosing the right format is important, poor formats may even degrade database performance. An example of a statically defined format is:

  John Doe
  Oak Drive #16384
  65535 AA
  Sunnyvale, California (the golden state)
Everybody (well, nearly everybody) will recognize this as an address. You just know that the first field is the name of a person, and that the second field is the street plus the number.
Databases are also often formatted as pipe-separated values:
  John Doe|Oak Drive #16384|65535 AA|Sunnyvale, California (the golden state)
Even though this format is popular, it does not make it any easier to use and it does not make the data more compact by putting everything on one line.
Problem with this data format is that it is not flexible; it is not possible to swap the fields, and it is also not possible to add more fields. There is no meta-data, which is data describing what the data actually means. The meaning of the fields are defined in hard-coded structures in the computer program that generates this data, which knows exactly how to handle each individual field.
To improve flexibility, we should add some meta-data. A better format would be:
  Name: John Doe
  Street: Oak Drive #16384
  Postal Code: 65535 AA
  City: Sunnyvale
  State: California (the golden state)
Now that we have the defined the meaning of the fields, we can easily add more fields, and we can more easily search for specific fields (eg., suppose you'd want to know how many people live in Sunnyvale, then you would only be interested in the 'city' field).

A small problem arises when you want to add multi-line fields, fields that can contain multiple lines of text. No matter how you solve this problem, the real solution is to specify how it works in the file format definition. (If it is left as an undocumented feature, no one is able to use it).

  Curriculum: John Doe studied Computer Science at Carnegie Mellon University, where he
   got his Master's degree at the age of 17. Then he started his own company with which he
   became a filthy rich bastard. One day he went to Las Vegas and lost it all at the poker table.
   So he jumped off a cliff.
   .
   His hobbies were: surfing, fishing, and hacking governmental web sites.

In this example we've added a few things:
- we defined a new field named 'Curriculum'
- we defined that a multi-line value begins with a space
- we defined that an empty line in a multi-line value begins with a space and consist of only a single dot (which is because...)
- we defined that the end-of-record indicator is an empty line

This exact format is actually being used by the Debian GNU/Linux packaging database. Note how indentation is being used to format the file.
Nowadays, XML is gaining popularity. XML looks a lot like HTML (or in fact, SGML):

<record>
  <name>John Doe</name>
  <street>Oak Drive #16384</street>
  <postal code>65535 AA</postal code>
  <city>Sunnyvale</city>
  <state>California (the golden state)</state>
  <curriculum>John Doe studied Computer Science at Carnegie Mellon University, where he
got his Master's degree at the age of 17. Then he started his own company with which he
became a filthy rich bastard. One day he went to Las Vegas and lost it all at the poker table.
So he jumped off a cliff.

His hobbies were: surfing, fishing, and hacking governmental web sites.</curriculum>
</record>
As you can see, XML does not only have end-of-record markers, but also explicit end-of-field markers. In our previously defined format, the end-of-field markers were implicit (newline). While the use of explicit field markers is much clearer, there is also a lot more overhead. This is a bad thing, although now there is no special case for reading multi-line values. Other good things of XML are: the identation does not matter, you can read an XML database as a stream. XML also defines a way to represent special characters, that are not present in the ASCII table. This is done in exactly the same way as in SGML.

While XML looks very professional, I often still prefer the line-mode format because of its simplicity, and I really dislike the unnecessary overhead of XML. (How do you like to fit an old dinosaur in 48K of memory? Nowadays, people don't bother about such things anymore. Why isn't the end-of-field marker something simple like '</>' ?)
Anyway, at this point it should be clear that you have to make a choice. If you are productive, you can have your database system work with multiple formats -- all you really have to do is write routines that can read and write the records in the specified formats.

Need for speed
Database applications need to be able to do fast lookups. If we can't do fast lookups, then we're better off paying lots of money and using Oracle anyway. The art of being fast is the art of indexing and of data partitioning.
An example: The mailman (or postman, if you prefer) has a letter for number 37. He is currently at number 30. The mailman (or postman) will soon be at number 37 to deliver the mail. A computer program or a robot could easily do this as well, because the indexing is done properly.
Another example: The mailman is on Maple Drive. He has a letter for Oak Drive, but doesn't have a clue how to get there. This is an example of bad indexing.

The following example explains data partitioning: suppose you have an online service running and you want to register all users. You do this in a pretty silly but convenient way, by creating a file in the filesystem for each user. You use the following directory structure:

  /home/walter/mydb/ ...
  users/
  users/A/
  users/B/
  users/C/
  ...
  users/Z/
The data has been split up, or partitioned, into 26 pieces (A to Z). You index the data by its capital. The file for user John Doe can quickly be found by going into the users/J/ directory and looking for the file there, rather than searching from the top.
(As you may have noticed, the filesystem can be an extremely efficient database if its directory structures are well organized. Also note that the filesystem is a tree-like structure with N branches at each node).
Data partitioning is especially important in case you are developing a parallel database. Because data partitioning scales in a linear fashion, parallel databases can grow to massive proportions, that can outperform any other database system -- if the data has been partitioned properly.

Another way of getting more speed out of your database system is to do caching. The one rule you have to remember is: disk is slow, memory is fast. Suppose you wanted to request the record for John Doe ten times in a row, a dumb system would load it from disk ten times. A clever system would load it from disk only the first time, remember it and use the in-memory copy for the other nine times.
(Modern operating systems actually do this for you, but you will still notice that writing a user-level cache for this is better than relying on the OS to do it for you).
Mind that you will need to keep an in-memory structure in which you can quickly find the requested record, if present in memory at all.

If your database is small, you can even load the whole thing in memory at startup and never read records from disk again. This may seem attractive, but if your database ever grows, it might grow out of available system memory sooner than you think. You also might be loading a lot of data into core memory, while only a small percentage of all this data is actually being used by the end user. This would be wasteful.

A technique I prefer is keeping an in-memory index. In-memory indexes work like this: you keep an index containing the file positions of all the records of your database files in memory at all times. Unless your database becomes very large, it should be possible to keep the entire index in memory. The index can be generated at startup, and generating an index won't take long (unless you have an extremely large amount of records, at which point you should resort to using Oracle or DB2 or something else). In the in-memory index, you also keep the primary key of the record. If you want to be able to search the database by more fields than just the primary key, then you should keep another index, in the most expensive case keep an index for each field of a record. Now, if a user requests a record, you can quickly look it up in the index, and use the file offset specified in the index to load the record from disk. You will want to cache this record as well, in case the user requests the record again. In pseudo-C:

  typedef struct {
      char key[80]                /* key of the record */
      long offset                 /* offset of record in database file */
      record_t *cached            /* if not NULL, record is cached in memory */
  } index_t
This index is still consuming too much memory, because the index_t structure is too large. A better choice might have been char *key. A really good choice would have been unsigned long key; calculate a hash value for each primary key and keep that in the index.

The index itself has to be stored in a structure that allows us to do fast searches, good choices would be: a binary search tree, an AVL tree, or a hash table. I am leaving you a bit in the dark here on which one would be the best choice. "It depends." If you are doing lots of searches, an AVL tree would be a good choice. If you are doing lots of insertions, a regular binary tree would be a good choice. A hash table can be a good choice, if you handle collisions efficiently (eg. by making circular doubly linked lists of collisions in the hash table). The advantage of hash tables is that you are already partitioning the data (remember reading about data partition just yet?). Growing or resizing a hash table is something you should avoid. If your database is sorted and read-only, a linear array could make a very simple but efficient index (use bsearch()). Okay, if you really want me to make a choice, I would choose a large hash table (question rises: How large is large? With a hash table, you should ask yourself "How fast do I want records to be found?" and "What is an acceptable number of collisions to traverse before a record is found?") consisting of binary trees of collisions. Because you have to calculate a hash address for searching an item in a hash table, you might be faster putting everything in a binary tree. I am gambling here that a calculation can be faster than traversing a tree (as said, proper data partitioning can really speed things up). In practice, this would depend on how large your database is and how complex the hashing algorithm is. I do not recommend using a strong authentication hashing algorithm such as MD5 or SHA-1. Instead, use something simple like CRC-16 or even better: roll your own simple-but-effective hashing algorithm. For ASCII data, 'shift-left 6 and XOR' does the job just fine.

When you are caching records (or objects) in memory, mind that you also need a good algorithm to expire objects from the cache. Generally you would use an LRU algorithm (Least Recently Used element expires from the cache), implemented as a linked list. To handle this linked list efficiently, I suggest you use a circular doubly linked list, so that you can easily add elements to the tail and also remove elements from the head (or the other way around, whichever you prefer). Each list element should be a reference to an index element, that has the pointer to the cached record. The reason for choosing this structure is that this way you can do all the all things you want to be able to do without corrupting memory:
- searching records in a fast manner
- caching records
- expiring records from cache in a fast manner
- putting reused cached records at the end of the cache list (in a fast manner)

The piece of code that returns a database record from the database layer to the application layer should make a copy of the database record. This is to prevent memory corruption in case of cache expiration, and the database records won't be effected in case of a bug in the higher levels of the application code.
There should also be some routines to quickly convert to and from an ASCII representation of a record to an in-memory record, since you wouldn't want to store integers and dates as plain ASCII all the time (now would you?).

purge /reclaim
Deleting records from a database file is easy. Instead of truly removing the record, it should be marked for deletion. Later on, the space can be freed up by doing a purge action. Purging is not always easy. If the database records have a fixed size, you should keep a list of deleted records and then overwrite those records by new records (simply reuse the space). If the records are not of a fixed size, purging can be a pain, depending on the database file size and the amount of deleted records. An easy solution would be to take the database offline, copy all valid records to a new database file, omitting the deleted records, and then move the new database file over the old one. Finally, restart the database engine.
Taking the database offline is often a problem, especially in our modern 24-hour economy. If your database is accessible over the Internet, there could be someone out there, somewhere in the world, at any time of the day who would want to access your database. To accomplish a 24-hour-a-day service and still do purges, the database engine could do a live purge in a running database. The way to go is to start a new database, and build up a second in-memory index for this new database. The old database does not become 'frozen', instead both databases will be active while the purging action is running.
When a user requests a record, first search the new index. If the record is not found, search the old index. If the record is found, copy it to the new database. If the record was not found or was deleted in the old database, tell the user that the record was not found. Meanwhile, while there are no pending user requests, go on with copying records from the old database to the new database until all old valid records have been transferred to the new database. While the new database is not yet complete, all delete actions must be done on both databases, as both copies are active at the same time. When the transfer is complete, all the 'old' deletes will have been purged. The old database can be closed off, and the database engine can continue to run with just one active copy.
The same technique can be used to make backups (or rolling snapshots) of active databases. What they do is create an active copy of the database, and at the point where they are synchronized, one copy is closed off. It is like running a mirror, and shutting the mirror down so it can be backed up to tape or moved to some other long term storage. As you can imagine, if the database is large, it may take quite some time before a snapshot is completed. Also note that you always need twice the memory and twice the disk space to safely carry out a purge or snapshot action.

Journalling and rollback
Every computer user has once lost some data due a bad floppy disk, system crash, or power failure. These malfunctions are particularly bad in case a database update was running. If database corruption occurs, the application is usually likely to just crash, and the end user may not able to easily reproduce the valuable data that was in the database. Therefore, all modern database systems have extra safety mechanisms included to keep the database from becoming corrupted.

Journalling is a mechanism that ensures the structure of the database file. Suppose the database engine is doing an update of a record, and then there is a power failure. When the computer restarts and the database initializes itself again, it should be able to see that was in the middle of an update event, and then repair the half-written record. Note that 'repair' is a bit optimistic, in practice it would remove the half-broken entry from the database file. How does it know it was doing an update before, and how does it know that the operation did not finish? The answer is simple; before doing the write, it writes a journalling record to a journal, which is just another file on disk. An operation is not completed until an end-of-journal marker has been written in the journal. The closed journal can then be overwritten by a new journal. An example of a journal is:
- event: begin journal => about to write record #1024
- event: writing record #1024, which is 256 bytes in length
- event: write of record #1024 finished sucessfully => end journal

For this technique to work, all write operations must be done synchronously. If the operating system would buffer the writes, one could not possibly know for sure whether a write made it to disk or not. This slows database systems down, but it is also possible to do deferred writing or letting another process do the synchronous writes in the background, while the application keeps on running. Synchronous writes can be done by using the flag O_SYNC wth the open() system call.

Analogous to this, many database systems do transaction logging, which is basically keeping a journal of all transactions performed by the database system. With journalling, the data to be written itself is (usually) not logged, just the meta-data of the structures, but with transaction logging the data can be logged as well. If the data contained in the transaction is also being logged, one could also include undo operations. This is called rollback. By keeping a history of all transactions being done, it becomes possible to do a rollback to a certain point in time. You could even rollback to the point where you inserted the first record into the database.
If you ever restore the database file from an old backup tape, you can apply the logs to bring the database back up to date again.

Conclusion
Writing a small and simple database system is not all that hard. Writing a sophisticated and robust system is quite a lot of work, though. Now that I've shown you how to do it, I'd say, have fun programming your own database system. You can make things as hard on yourself as you want to, really, but you will find that once you have started working on something simple, you will want to have the more advanced features as well. It might be interesting to keep some live statistics on the performance of your database system, and doing some tuning to get the most out of it.
While it still may be smarter to use existing database software, I feel it is important to know exactly how this software works, and knowing how to build it just in case it didn't exist. Furthermore, it is interesting from a purely computer scientific point of view.


If you really must, you can contact the author at walter at heiho dot net