Garbage In, Garbage Out

The saying "garbage in, garbage out" refers to the principle that when you put garbage into any given algorithm, it will produce a meaningless result. For example, by throwing a couple of random numbers into Einstein's formula's of general relativity, you will not get any useful results. This is the chaos principle. It also applies to badly written programs that do not check their input properly, and then produce garbage output based on the garbage input. Anyway, this rant is not about this at all; it is about garbage collection (again).
Not so long ago, I wrote a rant about eliminating memory leaks. The model works, except for a special case that I will mention later on in this rant. I was not completely satisfied because I kept thinking about the reference counters... and it opens up a whole can of worms if you want to do it properly.

Suppose you have a structure that you want to put on two different lists, or even better, on a list and also into a dictionary (hash table, hashed by name). The implementation is such that you actually put references to the allocated structure on the list and into the dictionary. This is nice, since the structure contents have not been copied, they will change in both the list and the dictionary, once you change one of the structure's fields. That's what pointers do for us.

Now, when we wish to free up the memory taken by the list, we can either choose to leave all the pointers to the items in the list alone, or to free them as well. Leaving them alone will probably cause memory leakage, freeing them up poses two problems:

  1. how do we know if we really should free this structure, as it may also be referenced by some other variable, like the dictionary
  2. perhaps the structure consisted of pointers to other allocated memory as well, so just calling free() would still cause a memory leak
The first problem can be solved by using a reference counter. When you add the structure to the list, increment a counter in this structure, saying that it is being referenced by a variable. When you add the structure to a dictionary, increment the reference counter. When you take it out, decrement the counter. The counter acts like a semaphore, once it reaches zero, it is not part of any list or dictionary or whatever.
If you are thinking "when the count drops to zero, you can free up the structure", you are jumping to conclusions and I must urge you to a halt now. Read on.

The second problem is rather important. The only way to solve this problem seems to introduce a destructor function. The destructor is a function that deinitializes the structure and frees up any resources that have been allocated for it. As you may have guessed, we have to get into object oriented programming here. If you are using C++ or another object oriented language, the problem may be less big... but somehow I'm feeling I'm getting deeper into the swamps already.

Now all our structures must begin with:

struct something_t {
    int ref_count;
    void (*destructor)(void *);

    ...
};
But suppose we do not want to put one of our structures in the list, but just a string? That is no longer an option, if we want to do such a thing then we have to declare a string object that has a reference count and a destructor. In we want to insert integer values into the list, we should turn them into integer objects.

When the reference count drops to zero, it does not really imply that we can just destroy the object, it merely means that it is not part of any list. However, you could set a convention that when it is not part of any list, you've obviously lost track of it, and it is time to clean it up. This is how kernels treat their resources.

Hold that thought. We should draw a line here. This solution seems workable to me, apart from the awkwardness of turning regular integers into objects and all the typecasting mess that is going to come from this.


Now I'm going to rant on.

As said, a kernel will clean up his objects when the reference count has dropped to zero. However, in a more general kind of program you will probably have some (local or global) pointer variable still pointing at the structure, and you may still want to use it, and therefore it would be dangerous to trigger such a destructive event automatically. On the other hand, if we do not do it automatically, we are likely to forget to clean it up later, and have a memory leak anyhow.
In a perfect world, the program would "know" you still wanted to use it, and still have a positive reference count. How could it possibly know? I will explain by giving an explanation (i.e, explaining).

Observe the following code fragment:

void func(void) {
object_t *a, *b;

    a = new_object();         a->ref_count++;
    b = a;                    b->ref_count++;

                              b->ref_count--;
                              a->ref_count--;      /* => ref_count is zero */
                              a->destructor(a);
}
On the left we see a C source code of a function that locally allocates a new object. On the right we see what the reference count should do. When we instantiate the object, its reference count is zero, since the space it occupies is not being referenced by anything or anyone. When a is assigned the object, the object is for the first time referenced, and the count should be incremented. Later, b is assigned to a, and the reference count is incremented. When the subroutine ends, both a and b go out of scope, and therefore they can no longer point at the allocated object. Its reference count drops to zero, the destructor is called, and we have succesfully collected the garbage.

There are a couple of questions that need be answered. Like, how do we know that a and b go out of scope? They could have been global variables just the same. This little problem could be solved by declaring a and b as local with a macro definition that adds them to a list of local variables. This list would need to be cleaned up at the end of the subroutine.
An object oriented language would automagically call the destructor when the variable goes out of scope, if a was declared as an object instance rather than a pointer to a dynamically allocated object.
Another question is 'what if we decided to pass a as return value?'. We'd have to add a special case for return values... the reference count would be decremented, but the destructor would not be called if a variable was actually assigned to the returned value in the parent routine. This means that the garbage collector would be called after the subroutine had returned rather than right before returning.
A statement like malloc(256); is not an assignment, how would you detect the leak and clean up the allocated memory? Ehrm, malloc() should be wrapped by a function like void *Malloc(int n) { return malloc(n); }, and now it becomes possible to detect it.
What about a simple statement like i = 6;? Well, since everything is all objects now, it should read ip = new_integer(6);, and the instance that represents the value 6 has a reference count of 1.
When doing the assignment of b = a;, you have to increment the reference count. You are bound to forget to do the increment some time... this should be done automatically. The only way to achieve this in standard C is to either use a macro definition so that it will read let(b, a); or to write an interpreter that does it for us. Both solutions are impractical.
A colleague alerted me that in object oriented languages, it is possible to overload the assignment operator. Using this feature, it becomes possible to do proper reference counting in a more or less transparent way. What is particularly interesting is that when you assign one object to another, you can just have the pointer point at it and increment the reference count. Now, when you change one instance, you should copy it into a new instance (with reference count 1) and make the change there. This is a copy-on-write action, and it is elegant because you do not waste memory on keeping double copies of objects with equal contents.
Mind that even if you are doing the housekeeping cleanly, this does not do the actual collection of garbage for you.

As you may have noticed, doing proper garbage collecting can (almost) only be done by an interpreting runtime system, that keeps track of all of your variables and notices when you are referencing objects.
If we want to have a workable way of fighting memory leaks, we can do a best effort using reference counts. Mind that this kind of code demands that you debug and test drive it to the limit (and beyond), because when you forget to increment or decrement the count just once, the whole thing blows.
Or I guess we could just embrace Python, Perl, PHP, or Java as favorite programming language, which already do garbage collecting for you.

*sighs*


If you really must, you can contact the author at walter at heiho dot net