insert into table values [ ... ] delete from table ... select * from ... update table ...I thought to myself, "I can do that", but doing it efficiently is not as easy as it may seem. This paper describes ideas on how to write your own high performance database engine.
Some people will ask 'Why?' (some always do), because you can interface with already existing database software, such as Oracle, DB2, PostgreSQL, and many others. While this is true, it does not seem efficient to me to install a heavy-weight database system just to catalogue your CD collection. Or, for example, to register users of a particular online service you are offering. In general, you would like to write an application that uses a database containing only a couple of hundred or perhaps thousand entries. My opinion is that writing dedicated code for this is way faster than using generalized heavy-weight existing database software. I shouldn't have said 'dedicated' here, because what we really want is something more flexible than that; preferably something that can be as flexible as the heavy-weight database systems, but still be small and fast and efficient. Besides, developing stuff by yourself is fun.
First steps: Choosing a database format
One of the first things to do is to define a database file format. We want the datbase format
to be easy to understand and easy to port to a different architecture, and therefore we choose
ASCII. ASCII is nice. ASCII is good. I like ASCII, because you can view or even edit your files
using a regular text editor. Be aware though, that you cannot have any glyphs in ASCII text;
no symbols or characters that are not in the ASCII standard. Suppose you were developing a
database system for storing Chinese names, then ASCII would have been a very poor choice.
The german Ringel-S and the french accent-circumflex are not easily stored in ASCII, although
you could resort to using some kind of escape mechanism to describe these characters in
ASCII (read on).
Now that we have chosen to use plain ASCII text, we still haven't chosen a database format. A database format is the layout of the database file. Choosing the right format is important, poor formats may even degrade database performance. An example of a statically defined format is:
John Doe Oak Drive #16384 65535 AA Sunnyvale, California (the golden state)Everybody (well, nearly everybody) will recognize this as an address. You just know that the first field is the name of a person, and that the second field is the street plus the number.
John Doe|Oak Drive #16384|65535 AA|Sunnyvale, California (the golden state)Even though this format is popular, it does not make it any easier to use and it does not make the data more compact by putting everything on one line.
Name: John Doe Street: Oak Drive #16384 Postal Code: 65535 AA City: Sunnyvale State: California (the golden state)Now that we have the defined the meaning of the fields, we can easily add more fields, and we can more easily search for specific fields (eg., suppose you'd want to know how many people live in Sunnyvale, then you would only be interested in the 'city' field).
A small problem arises when you want to add multi-line fields, fields that can contain multiple lines of text. No matter how you solve this problem, the real solution is to specify how it works in the file format definition. (If it is left as an undocumented feature, no one is able to use it).
Curriculum: John Doe studied Computer Science at Carnegie Mellon University, where he got his Master's degree at the age of 17. Then he started his own company with which he became a filthy rich bastard. One day he went to Las Vegas and lost it all at the poker table. So he jumped off a cliff. . His hobbies were: surfing, fishing, and hacking governmental web sites.In this example we've added a few things:
This exact format is actually being used by the Debian GNU/Linux packaging database.
Note how indentation is being used to format the file.
Nowadays, XML is gaining popularity. XML looks a lot like HTML (or in fact, SGML):
<record> <name>John Doe</name> <street>Oak Drive #16384</street> <postal code>65535 AA</postal code> <city>Sunnyvale</city> <state>California (the golden state)</state> <curriculum>John Doe studied Computer Science at Carnegie Mellon University, where he got his Master's degree at the age of 17. Then he started his own company with which he became a filthy rich bastard. One day he went to Las Vegas and lost it all at the poker table. So he jumped off a cliff. His hobbies were: surfing, fishing, and hacking governmental web sites.</curriculum> </record>As you can see, XML does not only have end-of-record markers, but also explicit end-of-field markers. In our previously defined format, the end-of-field markers were implicit (newline). While the use of explicit field markers is much clearer, there is also a lot more overhead. This is a bad thing, although now there is no special case for reading multi-line values. Other good things of XML are: the identation does not matter, you can read an XML database as a stream. XML also defines a way to represent special characters, that are not present in the ASCII table. This is done in exactly the same way as in SGML.
While XML looks very professional, I often still prefer the line-mode format because of its
simplicity, and I really dislike the unnecessary overhead of XML. (How do you like to fit an
old dinosaur in 48K of memory? Nowadays, people don't bother about such things anymore.
Why isn't the end-of-field marker something simple like '</>' ?)
Anyway, at this point it should be clear that you have to make a choice. If you are
productive, you can have your database system work with multiple formats -- all you really
have to do is write routines that can read and write the records in the specified formats.
Need for speed
Database applications need to be able to do fast lookups. If we can't do fast lookups, then
we're better off paying lots of money and using Oracle anyway. The art of being fast is the
art of indexing and of data partitioning.
An example: The mailman (or postman, if you prefer) has a letter for number 37. He is currently
at number 30. The mailman (or postman) will soon be at number 37 to deliver the mail. A computer
program or a robot could easily do this as well, because the indexing is done properly.
Another example: The mailman is on Maple Drive. He has a letter for Oak Drive, but doesn't have
a clue how to get there. This is an example of bad indexing.
The following example explains data partitioning: suppose you have an online service running and you want to register all users. You do this in a pretty silly but convenient way, by creating a file in the filesystem for each user. You use the following directory structure:
/home/walter/mydb/ ... users/ users/A/ users/B/ users/C/ ... users/Z/The data has been split up, or partitioned, into 26 pieces (A to Z). You index the data by its capital. The file for user John Doe can quickly be found by going into the users/J/ directory and looking for the file there, rather than searching from the top.
Another way of getting more speed out of your database system is to do caching.
The one rule you have to remember is: disk is slow, memory is fast. Suppose you wanted to
request the record for John Doe ten times in a row, a dumb system would load it from disk
ten times. A clever system would load it from disk only the first time, remember it and use
the in-memory copy for the other nine times.
(Modern operating systems actually do this for you, but you will still notice that writing
a user-level cache for this is better than relying on the OS to do it for you).
Mind that you will need to keep an in-memory structure in which you can quickly find the
requested record, if present in memory at all.
If your database is small, you can even load the whole thing in memory at startup and never read records from disk again. This may seem attractive, but if your database ever grows, it might grow out of available system memory sooner than you think. You also might be loading a lot of data into core memory, while only a small percentage of all this data is actually being used by the end user. This would be wasteful.
A technique I prefer is keeping an in-memory index. In-memory indexes work like this: you keep an index containing the file positions of all the records of your database files in memory at all times. Unless your database becomes very large, it should be possible to keep the entire index in memory. The index can be generated at startup, and generating an index won't take long (unless you have an extremely large amount of records, at which point you should resort to using Oracle or DB2 or something else). In the in-memory index, you also keep the primary key of the record. If you want to be able to search the database by more fields than just the primary key, then you should keep another index, in the most expensive case keep an index for each field of a record. Now, if a user requests a record, you can quickly look it up in the index, and use the file offset specified in the index to load the record from disk. You will want to cache this record as well, in case the user requests the record again. In pseudo-C:
typedef struct {
char key[80] /* key of the record */
long offset /* offset of record in database file */
record_t *cached /* if not NULL, record is cached in memory */
} index_t
This index is still consuming too much memory, because the index_t structure is
too large. A better choice might have been char *key. A really good choice would
have been unsigned long key; calculate a hash value for each primary key and keep
that in the index.
The index itself has to be stored in a structure that allows us to do fast searches, good choices would be: a binary search tree, an AVL tree, or a hash table. I am leaving you a bit in the dark here on which one would be the best choice. "It depends." If you are doing lots of searches, an AVL tree would be a good choice. If you are doing lots of insertions, a regular binary tree would be a good choice. A hash table can be a good choice, if you handle collisions efficiently (eg. by making circular doubly linked lists of collisions in the hash table). The advantage of hash tables is that you are already partitioning the data (remember reading about data partition just yet?). Growing or resizing a hash table is something you should avoid. If your database is sorted and read-only, a linear array could make a very simple but efficient index (use bsearch()). Okay, if you really want me to make a choice, I would choose a large hash table (question rises: How large is large? With a hash table, you should ask yourself "How fast do I want records to be found?" and "What is an acceptable number of collisions to traverse before a record is found?") consisting of binary trees of collisions. Because you have to calculate a hash address for searching an item in a hash table, you might be faster putting everything in a binary tree. I am gambling here that a calculation can be faster than traversing a tree (as said, proper data partitioning can really speed things up). In practice, this would depend on how large your database is and how complex the hashing algorithm is. I do not recommend using a strong authentication hashing algorithm such as MD5 or SHA-1. Instead, use something simple like CRC-16 or even better: roll your own simple-but-effective hashing algorithm. For ASCII data, 'shift-left 6 and XOR' does the job just fine.
When you are caching records (or objects) in memory, mind that you also need a good
algorithm to expire objects from the cache. Generally you would use an LRU algorithm
(Least Recently Used element expires from the cache), implemented as a linked list. To handle
this linked list efficiently, I suggest you use a circular doubly linked list, so that you
can easily add elements to the tail and also remove elements from the head (or the other way
around, whichever you prefer). Each list element should be a reference to an index element,
that has the pointer to the cached record. The reason for choosing this structure is that
this way you can do all the all things you want to be able to do without corrupting memory:
- searching records in a fast manner
- caching records
- expiring records from cache in a fast manner
- putting reused cached records at the end of the cache list (in a fast manner)
The piece of code that returns a database record from the database layer to the application
layer should make a copy of the database record. This is to prevent memory corruption in
case of cache expiration, and the database records won't be effected in case of a bug in
the higher levels of the application code.
There should also be some routines to quickly convert to and from an ASCII representation of a
record to an in-memory record, since you wouldn't want to store integers and dates as plain ASCII
all the time (now would you?).
purge /reclaim
Deleting records from a database file is easy. Instead of truly removing the record, it should
be marked for deletion. Later on, the space can be freed up by doing a purge action. Purging
is not always easy. If the database records have a fixed size, you should keep a list of deleted
records and then overwrite those records by new records (simply reuse the space).
If the records are not of a fixed size, purging can be a pain, depending on the database file size
and the amount of deleted records. An easy solution would be to take the database offline,
copy all valid records to a new database file, omitting the deleted records, and then move the new
database file over the old one. Finally, restart the database engine.
Taking the database offline is often a problem, especially in our modern 24-hour economy.
If your database is accessible over the Internet, there could be someone out there, somewhere in
the world, at any time of the day who would want to access your database. To accomplish a
24-hour-a-day service and still do purges, the database engine could do a live purge in a
running database. The way to go is to start a new database, and build up a second in-memory
index for this new database. The old database does not become 'frozen', instead both databases
will be active while the purging action is running.
When a user requests a record, first search the new index. If the record is not found,
search the old index. If the record is found, copy it to the new database. If the record was
not found or was deleted in the old database, tell the user that the record was not found.
Meanwhile, while there are no pending user requests, go on with copying records from the old
database to the new database until all old valid records have been transferred to the new database.
While the new database is not yet complete, all delete actions must be done on both databases,
as both copies are active at the same time. When the transfer is complete, all the 'old' deletes
will have been purged. The old database can be closed off, and the database engine can continue
to run with just one active copy.
The same technique can be used to make backups (or rolling snapshots) of active
databases. What they do is create an active copy of the database, and at the point where they
are synchronized, one copy is closed off. It is like running a mirror, and shutting the mirror
down so it can be backed up to tape or moved to some other long term storage. As you can imagine,
if the database is large, it may take quite some time before a snapshot is completed. Also note
that you always need twice the memory and twice the disk space to safely carry out a purge
or snapshot action.
Journalling and rollback
Every computer user has once lost some data due a bad floppy disk, system crash, or power
failure. These malfunctions are particularly bad in case a database update was running.
If database corruption occurs, the application is usually likely to just crash,
and the end user may not able to easily reproduce the valuable data that was in the database.
Therefore, all modern database systems have extra safety mechanisms included to keep the
database from becoming corrupted.
Journalling is a mechanism that ensures the structure of the database file. Suppose
the database engine is doing an update of a record, and then there is a power failure.
When the computer restarts and the database initializes itself again, it should be able to
see that was in the middle of an update event, and then repair the half-written record.
Note that 'repair' is a bit optimistic, in practice it would remove the half-broken entry from
the database file. How does it know it was doing an update before, and how does it know that
the operation did not finish? The answer is simple; before doing the write, it writes
a journalling record to a journal, which is just another file on disk. An operation is not
completed until an end-of-journal marker has been written in the journal. The closed journal
can then be overwritten by a new journal. An example of a journal is:
- event: begin journal => about to write record #1024
- event: writing record #1024, which is 256 bytes in length
- event: write of record #1024 finished sucessfully => end journal
For this technique to work, all write operations must be done synchronously. If the operating system would buffer the writes, one could not possibly know for sure whether a write made it to disk or not. This slows database systems down, but it is also possible to do deferred writing or letting another process do the synchronous writes in the background, while the application keeps on running. Synchronous writes can be done by using the flag O_SYNC wth the open() system call.
Analogous to this, many database systems do transaction logging, which is basically
keeping a journal of all transactions performed by the database system. With journalling,
the data to be written itself is (usually) not logged, just the meta-data of the structures,
but with transaction logging the data can be logged as well. If the data contained in the
transaction is also being logged, one could also include undo operations. This is
called rollback. By keeping a history of all transactions being done, it becomes
possible to do a rollback to a certain point in time. You could even rollback to the point
where you inserted the first record into the database.
If you ever restore the database file from an old backup tape, you can apply the logs to bring
the database back up to date again.
Conclusion
Writing a small and simple database system is not all that hard. Writing a sophisticated and
robust system is quite a lot of work, though. Now that I've shown you how to do it, I'd say,
have fun programming your own database system. You can make things as hard on yourself as you
want to, really, but you will find that once you have started working on something simple,
you will want to have the more advanced features as well. It might be interesting to keep some
live statistics on the performance of your database system, and doing some tuning to get the most
out of it.
While it still may be smarter to use existing database software, I feel it is important to
know exactly how this software works, and knowing how to build it just in case it didn't
exist. Furthermore, it is interesting from a purely computer scientific point of view.