This is a white paper about CBBX. CBBX stands for Clustered BBX. In CBBX, we will have a networked version of BBX that will run on a (possibly large) number of nodes in a cluster. Mind that the term cluster refers to 'a collection of systems', and has nothing to do with Beowulfs or something like that.
I first had the idea of making a clustered BBS already three to four years ago; if a server
was too far away from the client, then the server had to move closer to the client.
However, closer to one user may be further away to another.
Also, there is a limit on server capacity. Attempts to drive the system over the limit will
certainly not do any good.
These two problems can be solved at once; there should be multiple servers (preferrably with
automatic load balancing). The multiple servers should be synchronized at all times, and
therefore they should communicate with eachother via a server-to-server protocol.
Multiple times I came up with a design in which one could add as many servers as you like
(at run-time), but of course it was never decently implemented. When I did a course on
high availability services and clustered filesystems, the idea of a clustered BBS popped up again.
BBX will provide some networking functions, to be used by the BBS main task, which is run
by the BBX system. BBX does not do anything with the cluster, all this should be taken care
of by either the BBS main task or a seperate task that manages cluster services.
The name CBBX was chosen only because you need to run BBX on each node, but in fact the
cluster is kept alive by the BBS code, not by the BBX runtime system. However, I use both the
names BBX and BBS or CBBX and CBBX BBS.
If this confuses you, never mind, just read on.
In CBBX, all data is replicated on each node. This is done for speed; a node always has the
data ready for the client. This also goes for user files, a user may log in to the system
anywhere he likes. In older designs, there were user database nodes or the system would
search for a user file, but that design is not fast.
So, a user logs in on any node (or BBS server) he likes, most likely closest server
available, because this server gives him fast access. Next, the user presses 'w' for the
who list, and sees all users logged in all over the network. He sees all users because all
data (including the data resident in the memory of CBBX) is replicated among all nodes.
When the user talks to another user, CBBX checks if it's a local or a remote user. If it's
a remote user, this CBBX server will push the message to the other CBBX servers.
If the user enters a message, the message is pushed to the other CBBX servers.
If the user removes a message or changes his configuration or created a new room or
whatever, this will be pushed over the network to other CBBX servers.
So, a lot of communication will be going on. However, not all servers need to be connected
to all servers. In other words, N servers need not be connected to N servers.
A server simply pushes received messages on to its neighbours. A server should have more
than one neighbour, in case a link fails or in case the server would die.
As said, the servers communicate with each other via a messaging system. In fact, these
messages are just database transactions. Transactions are actions like add a message,
remove a message, modify a configuration. Transactions are happening in parallel on all
nodes at all times. Therefore it is not possible to use message numbers in the same way as
they are used on a standalone BBS; instead a time stamp should be used and it should be
possible to insert new messages in between if the time stamp is earlier. As a consequence,
time should be synchronized more or less between the nodes. There is no need for ntp
precision (although it's always very nice of course), but the nodes should give roughly the
same BBS time (no need to sync system clocks!). CBBX should take care of this by time
stamping its transactions.
The transactions have a unique identifier, which consists of a counter and the node id.
The node id is assigned when a node joins the cluster; each node will know what other nodes
are available.
When a node goes down, the neighbours cannot communicate with it any more. In such case, the
database of the downed node is no longer synchronized with the database of the others.
When the downed node comes up again and wants to join the cluster, it has to synchronize
the database first. This is done by applying the transaction journals of a neighbour node.
If it has been ages since the node went down, and the transaction journals have been lost
(or rotated), it will have to do a full sync.
The nodes should communicate amongst each other about the status of the nodes in the cluster.
Since the nodes in the network have an N to M relation, it will often happen
that a node receives the transactions twice. The node should remember the ID of the last
received transaction from a specific node, and if it already received that message, discard
it and not propagate it over the cluster.
As an optimization, there should be some sort of routing done for express messages.
In order to do good routing, all nodes should keep a map of what the configuration
of the cluster looks like. If a link or node goes down while a message is on
its way, it should be re-routed automatically by the node where it's at.
A race condition exists for new users. In this type of BBS, users choose their own unique
login name. If a new user is created on a node, and a new user is created on another node
with the exact same login name, we have a collision. We could check the time stamps and
transaction IDs, but a new user will not be happy to find that he cannot have this login
name later on. To avoid this problem, a master node (lowest node ID in the cluster)
should be consulted before giving away the login name.
If the master node goes down, the node with the next lowest ID automatically takes over.
In terms of availability, this is a very safe and highly available system. There can be
numerous paths between servers, eliminating single points of failure. If a server goes down,
a server can choose to just leave it, retry, or connect to another server. If a client looses
connection due to a server breakdown, it has these options also.
Another advantage is that this system can do a rolling upgrade; when you upgrade one node
at a time, the whole system can be upgraded without ever going offline completely.
You can already take advantage of this even when your cluster consists of only two nodes.
See also: BBX