The saying "garbage in, garbage out" refers to the principle that when
you put garbage into any given algorithm, it will produce a meaningless
result. For example, by throwing a couple of random numbers into Einstein's
formula's of general relativity, you will not get any useful results.
This is the chaos principle. It also applies to badly written programs
that do not check their input properly, and then produce garbage output
based on the garbage input. Anyway, this rant is not about this at all;
it is about garbage collection (again).
Not so long ago, I wrote a rant about eliminating memory leaks. The model
works, except for a special case that I will mention later on in this rant.
I was not completely satisfied because I kept thinking about the reference
counters... and it opens up a whole can of worms if you want to do it
properly.
Suppose you have a structure that you want to put on two different lists, or even better, on a list and also into a dictionary (hash table, hashed by name). The implementation is such that you actually put references to the allocated structure on the list and into the dictionary. This is nice, since the structure contents have not been copied, they will change in both the list and the dictionary, once you change one of the structure's fields. That's what pointers do for us.
Now, when we wish to free up the memory taken by the list, we can either choose to leave all the pointers to the items in the list alone, or to free them as well. Leaving them alone will probably cause memory leakage, freeing them up poses two problems:
The second problem is rather important. The only way to solve this problem seems to introduce a destructor function. The destructor is a function that deinitializes the structure and frees up any resources that have been allocated for it. As you may have guessed, we have to get into object oriented programming here. If you are using C++ or another object oriented language, the problem may be less big... but somehow I'm feeling I'm getting deeper into the swamps already.
Now all our structures must begin with:
struct something_t {
int ref_count;
void (*destructor)(void *);
...
};
But suppose we do not want to put one of our structures in the list,
but just a string? That is no longer an option, if we want to do such
a thing then we have to declare a string object that has a reference
count and a destructor. In we want to insert integer values into the
list, we should turn them into integer objects.
When the reference count drops to zero, it does not really imply that we can just destroy the object, it merely means that it is not part of any list. However, you could set a convention that when it is not part of any list, you've obviously lost track of it, and it is time to clean it up. This is how kernels treat their resources.
Hold that thought. We should draw a line here. This solution seems
workable to me, apart from the awkwardness of turning regular integers
into objects and all the typecasting mess that is going to come from this.
Now I'm going to rant on.
As said, a kernel will clean up his objects when the reference count has
dropped to zero. However, in a more general kind of program you will
probably have some (local or global) pointer variable still pointing
at the structure, and you may still want to use it, and therefore it
would be dangerous to trigger such a destructive event automatically.
On the other hand, if we do not do it automatically, we are likely to
forget to clean it up later, and have a memory leak anyhow.
In a perfect world, the program would "know" you still wanted to use it,
and still have a positive reference count. How could it possibly know?
I will explain by giving an explanation (i.e, explaining).
Observe the following code fragment:
void func(void) {
object_t *a, *b;
a = new_object(); a->ref_count++;
b = a; b->ref_count++;
b->ref_count--;
a->ref_count--; /* => ref_count is zero */
a->destructor(a);
}
On the left we see a C source code of a function that locally
allocates a new object. On the right we see what the reference count
should do. When we instantiate the object, its reference count is zero,
since the space it occupies is not being referenced by anything or
anyone. When a is assigned the object, the object is for
the first time referenced, and the count should be incremented.
Later, b is assigned to a, and the reference count
is incremented. When the subroutine ends, both a and b
go out of scope, and therefore they can no longer point at the allocated
object. Its reference count drops to zero, the destructor is called,
and we have succesfully collected the garbage.
There are a couple of questions that need be answered. Like, how do we
know that a and b go out of scope? They could have
been global variables just the same. This little problem could be solved
by declaring a and b as local with a macro definition
that adds them to a list of local variables. This list would need to be
cleaned up at the end of the subroutine.
An object oriented language would automagically call the destructor when
the variable goes out of scope, if a was declared as an
object instance rather than a pointer to a dynamically allocated object.
Another question is 'what if we decided to pass a as return value?'.
We'd have to add a special case for return values... the reference count
would be decremented, but the destructor would not be called if a variable
was actually assigned to the returned value in the parent routine. This
means that the garbage collector would be called after the subroutine
had returned rather than right before returning.
A statement like malloc(256); is not an assignment, how would you
detect the leak and clean up the allocated memory? Ehrm, malloc()
should be wrapped by a function like void *Malloc(int n) {
return malloc(n); }, and now it becomes possible to detect it.
What about a simple statement like i = 6;? Well, since everything
is all objects now, it should read ip = new_integer(6);, and the
instance that represents the value 6 has a reference count of 1.
When doing the assignment of b = a;, you have to increment the
reference count. You are bound to forget to do the increment some time...
this should be done automatically. The only way to achieve this in standard
C is to either use a macro definition so that it will read let(b, a);
or to write an interpreter that does it for us. Both solutions are
impractical.
A colleague alerted me that in object oriented languages,
it is possible to overload the assignment operator. Using this feature,
it becomes possible to do proper reference counting in a more or less
transparent way. What is particularly interesting is that when you assign
one object to another, you can just have the pointer point at it and increment
the reference count. Now, when you change one instance, you should
copy it into a new instance (with reference count 1) and make the change there.
This is a copy-on-write action, and it is elegant because you do not waste
memory on keeping double copies of objects with equal contents.
Mind that even if you are doing the housekeeping cleanly, this does not do
the actual collection of garbage for you.
As you may have noticed, doing proper garbage collecting can (almost) only
be done by an interpreting runtime system, that keeps track of all
of your variables and notices when you are referencing objects.
If we want to have a workable way of fighting memory leaks, we can
do a best effort using reference counts. Mind that this kind of code
demands that you debug and test drive it to the limit (and beyond),
because when you forget to increment or decrement the count just once,
the whole thing blows.
Or I guess we could just embrace Python, Perl, PHP, or Java as favorite
programming language, which already do garbage collecting for you.
*sighs*