| The Gnutella Network and why it sucks |
Okay, first of all, the Gnutella works perfectly -- in a sense that we've all
been able to download illegal mp3z and tons of porn. When I first read the
Gnutella network protocol specification, I was hooked. However, I'm having
serious second thoughts on the protocol. On the technical side, the Gnutella
network kinda sucks. The problem is in the very core of the net, the protocol.
And because of that, the Gnutella net is not likely to be improved soon.
Note that in a topology unaware network, there is no need for a PING and
PONG message. I guess it has been implemented because people want to know
how many servents are out there and how many bytes of data are available.
To help out users with a slow dial-up connection, the Gnutella backbone was
invented. The backbone consists of a few hosts acting as host cache. In
practice, everyone just connects to a host cache a let him to the job. As a
result, the Gnutella network is quickly turning into a small network with
a few large servers rather than featuring a chaotic structure.
Secondly, we want to get rid of having to keep all types of history. This
begins with stating what a unique message id should be: it should be a
combination of IP number, port number, servent startup time, and a message
counter. This can be encoded in 12 bytes (so it is even smaller than the
currently used 16 byte identifier). The IP and port number define the
instance of the servent on a host, the startup time ensures we see the
difference if a servent reboots, and the message counter is just a counter
that increments with each message sent. When this counter reaches maxint,
the servent startup time should be either reset or incremented.
Message routing now works like this: the servent only remembers the last
message it has seen on a specific port. If the counter of a newly arrived
message is lower than it remembers, it should drop the message. Otherwise,
it should remember this one and forward it to all ports.
As a third improvement, we also do not want to keep a history of on which port
we saw a specific message. Instead, the routing information should be kept
in the message header itself. The message header should be a stack of message
headers, and each servent adds his id (IP and port number) to this stack.
When replying to a message, it removes his own id and sends the message to
the port which has the id that is now on top of the message header stack.
A technique just like this is actually being used in the IP protocol stack.
As a result, no servent needs to keep a history.
As a fourth recommendation, loose the PING message. It is small so it is
probably not wasting much bandwidth, but surely there are better ways to do
this. Since a servent knows if it already has seen a message or not, we can
let a message with a grand total go round and let everybody update the grand
total. This way we automatically have periodic updates on what the total size
of the network is. To keep the net from flooding like it does with PING,
this message should be initiated only if (and only if) the servent hasn't seen
a message describing a total for the past 5 minutes. When the message has
reached it's TTL or when a servent has no one else to forward the message to,
a new message of a different type should be broadcasted 'backwards' so that
everyone gets the latest updated value of the grand total. This value is not
updated anymore, after a while it dies out and when the timeout is reached
someone will send out a new 'ping' message to demand a recount of the grand
total. This works best if the servents wait to collect totals from a number of
ports, instead of immediately forwarding the maximum to everyone.
Summarizing the proposed improvements:The Gnutella Network
First a summary of the Gnutella protocol
specification.
The Gnutella network operates by forwarding messages to all connected clients,
or so to speak, servents. Each client in the topology unaware network
is connected to a few other clients, and is thus a server at the same time.
So servent A sends his message to connected servents B, C, and D, and they
will each send the message to any servent that they are connected to.
To prevent messages from going around and around all the time, a hop counter
is kept in the message and when it reaches a certain amount (actually, when
the Time To Live or TTL reaches zero) the message is dropped and taken off
the network.
This was not good enough, so we also keep a history of messages that we've
already seen. A servent may think up his own (supposedly) globally unique
identifier to identify a message. If the receiving servent has already seen
a message with this unique id, it will drop the message because there is no
use in forwarding messages you've already forwarded yourself before.
If a servent wishes to reply to a message, it creates a message with the same
id, but with another payload descriptor (which is like a message type
identifier) and routes the message back through the port where it got the
original message from. In order for routing to work, the servents along the
route must keep a history of where they got each message from.
In order to be able to work with firewalls, there is also a kludge with
pushing messages and using a servent identifier to identify the servent that
this message is meant for.
When a download has to take place, a connection is made on which the HTTP
protocol is used to retrieve a file.
Why it sucks
I have a number of problems with this protocol:
The protocol is very compact (but inconsistent here and there), everything is
encoded in a few bytes. But in the end, however, we resort using the elaborate
HTTP ascii protocol. I guess the initial developer had some HTTP download code
hanging around or something.
The network breaks if the same ids show up. Why not use IP and port number of
the servent as base for the id? Most servent implementations do, but again,
there are no set rules to generate a message id.
On the other hand, HTTP is cool because now we can connect to a servent using
a web browser. In practice, it sucks, because most servents just have all their
links point to a site where you can download their servent binary
(which is very irritating).
Also, if the servent rescans his shared directories just before someone
starts a download, he will not be able to get the file because the file
identification number has changed. This does not happen a lot, but it does
happen sometimes.
If a servent disconnects and re-connects, it gets a brand new servent id and
it has lost it's history. This has consequences for users that were going
to download our files.
Technically speaking, the PING and PONG are useless and only wasting bandwidth.
A PING message generates thousands and thousands of PONG messages, and because
a PING is routed to everyone, one PING message effectively makes everybody
ping everybody. Not to mention the fact that some servent implementations do
periodic pinging.
Proposal for making things better
First of all, the protocol should be consistent.
If a servent was rebooted, the time signature won't be the same so if the
message counter is lower, but the time stamp is greater, then the message
can also be forwarded to all ports.
The route breaks when an intermediate servent disconnects, but this is
also the case in the current version of the Gnutella network protocol.
One could argue that tampering with messages and routes is now possible, but
this is also the case in the current version of the Gnutella network protocol.
This is different from the current PING and PONG implementation, with PING you
broadcast to everyone and get a flood of routed replies back. With the
technique described here, the reply is also broadcasted so that everyone
can enjoy the updated grand total. It is true that you may receive all
kinds of different totals, you should pick the highest value, lower values
should be dropped. It still costs a lot of bandwidth, but at least you are not
routing all PONGs back to everybody (since everybody pings). Instead, you are
participating in a continuous global broadcast.
Conclusion
The Gnutella network is very nice, but things can be done better. It is hard
to change an existing protocol, especially when you already have thousands of
users. However, using the name 'Gnutella', it should be possible to urge the
world to go use the latest improved version of this network.
The gains are that servents can be very small and efficient programs and need
not consume a lot of memory.