Pablo
Molinero-Fernández, Nick McKeown, Stanford University
We have
studied the routing behavior in the Internet, and its evolution through time. The
goal is to obtain some heuristics that can be used to optimize any soft state
that is maintained in the network. We have done this work without having direct
access to the routing protocol exchanges along the paths that we have observed.
We analyzed packet traces and traceroute measurements from NLANR [1][2]. This method is similar to the one used by Vern
Paxson in 1994 and 1995 [5]. By using similar measurements we can see whether the
instability of the paths in the Internet has changed in recent years, as the
Internet has changed a lot since 1995. Among other things the network has grown
in size, traffic and capacity, and short HTTP bursts have become the dominant
type of traffic.
The
reliability and stability of the network is in many ways dictated by the
routing behavior. Analyzing this stability is important to locate and
understand the cause of some of the problems in the network. It is also very
useful in case we want to use some heuristics to design some network mechanism
or protocol. For example, if one wanted to maintain in the routers some
quality-of-service state that is bound to TCP flows, the inactivity timeouts
for releasing this state are determined by the probability of seeing a route
change or a connection failure during the lifetime of a TCP connection.
A way to
study the routing instabilities is done by observing the exchanges of routing
protocol messages between routers, as they reflect the view of the network that
routers have. Between 1997 and 1998 Labovitz, Ahuja and Jahanian [4] studied these routing protocol exchanges and the
incidence reports from the network operator. They state that the median mean
time between routes changes is between 16 hours and 1.8 days for inter-domain
changes (i.e. route changes spanning several domains) and between 55 and 72
minutes for intra-domain changes. If we consider TCP flows to last between 3
and 7 seconds, a flow will typically have a probability of 0.002-0.012% of
being rerouted for inter-domain routes, and 0.07-0.21% for intra-domain routes.
As for
connectivity failures, they are less common. The same authors report a median
mean time between failures between 8 and 11 days for inter-domain routes, and
between 25 and 45 days for intra-domain routes. The probabilities of seeing a
failure for a typical TCP flow will be around 3.2-10*10^-6 for inter-domain
routes, and 0.7-3.2*10^-6 for intra-domain routes.
However,
one does not always have access to the routing information that is exchanged between
the nodes in the network. In this case one has to resort to measurements on how
the packets themselves are treated by the network when there is a change in
routing. We used the program traceroute to see what routers are being
visited by the packets sent between two end hosts.
Between
1994 and 1995 Vern Paxson [5] performed traceroute measurements to study the
routing behavior of the network. He reports that about 2/3 of the routes remain
unchanged for over days or weeks. Paxson also found out that in 1995 (1994)
0.44% (0.16%) of the measurements saw a route change, whereas 2.7% (1.2%) of
the measurements had route failures. The reason for these higher figures is
that traceroute also reflects route changes along all domains that it
traverses. The longer the path the more likely it will be rerouted. In the case
of the route failure the measurements do not distinguish between a connectivity
failure and one of the end-host being down.
The network
has changed a lot in the past five years. HTTP has become the dominant type of
traffic in the Internet. This has led to smaller flows, also known as mice,
which do not last long enough as to perform congestion control efficiently.
Since 1995, the Internet has become more pervasive; it has grown in number of
users and capacity of the backbone. Finally, access networks are much faster
now than 5 years ago.
For these
reasons we have revisited the results from Vern Paxson, to see if there have
been any major changes since then. We used traceroute measurements that are
available at NLANR [2].
The first
thing to compare between Paxson's study and ours is the number of hops. Paxson
used quite a few nodes outside of the US; the result is that the average hop
count is 15.9, with the 80th-percentile being in 20.0 hops. In our study, most
hosts were in research institutions in the US, so the average hop count is
11.9, with the 80-th percentile in 15.0 hops. Another difference is that Paxson
used exponential sampling, whereas we have used a periodic sampling. One can see
our method as oversampling: If we just chose samples that are exponentially
distributed, then we could eliminate the rest.
We analyzed
the measurements of the 440 paths between 22 sites during the whole month of
October 2000. We had over 4300 measurements per path. Because of the long
interval between the samples (10 mins.), we can assume that the successive
traceroutes are independent, and that any transient effects have stabilized by
the next measurement. We developed our own tools [3] to visualize and categorize the changes that happened
in the route between two measurement points, as the following figure shows.

Figure 1.- Summary of all paths taken by the
traceroute measurements
The numbers in black on the
top represent the last byte of the IP address of the router. A number like X14X
means that there was no answer from the router that is 14 hops away. The
numbers in blue at the bottom represent the percentage (or per mille) of routes
that traverse that particular router. The thickness of the line is proportional
to that percentage.
What will really allow us
to observe what is happening is the graphing of the time evolution of the
traceroute measurement as shown in the figure below. When consecutive samples take the same path they are
merged and the line is thickened. When there is a change in the route, the new path
is plotted.

Figure
2.- Graph showing
the evolution with time of the measurements
In this example, there are
898 traceroute samples considered. The first 223 measurements (between Oct 1 at
0:03 and Oct 2 at 13:03, every 10 minutes) took the same route, going from
Stanford University to University of Colorado at Boulder through CalRen2 and
Abilene. Then between 13:03 and 13:26 there was at least one route change,
where the path goes through bbnplanet and qwest. Note that we cannot know
whether there has been more than one change, because we are not doing a
continuous sampling, and for some time no packets were went from Stanford to Colorado.
Then there is another route
change before 13:30, and after that there are 168 samples using the same path
as at the beginning. Between 17:23 and 17:53 the probe arrives to U. Colorado,
and it does not get an answer from the destination for three consecutive
measurements. At 18:03 that answer returns, the destination came back up. Note
that the destination is the only node that does not have to answer to an
expired TTL, since the probe is a packet directly addressed to it.
At 18:13 the route returns
to the original one, and it stays there for another 423 samples. Before 16:43
there is still one change inside of the Stanford domain, and then at 16:53 it
returns to the original path, where it stays until the end.
Our tool can generate these
and other graphs, and it can automate the analysis of the trace. For more
information, please visit http://klamath.stanford.edu/tools/Traceroute/index.html
|
|
average |
median |
80%-percentile |
90%-percentile |
|
Frequency
of a route change in the 10-min period |
1.73% |
0.72%, |
1.82%, |
4.79% |
|
Frequency
of a route change DURING the traceroute |
0.50%, |
0.02%, |
0.21%, |
1.14% |
|
Frequency
of a route failure (transient or longer |
0.84%, |
0.00%, |
0.20%, |
1.58% |
Table 1. Summary of probabilities of route changes and failures observed with traceroutes over a one-month period
We can see
heavy tails in the distribution, this is because there are a few routes that
have repetitively a rather high number of route changes. These are usually
produced by routing flaps between two routers inside the same autonomous system
(AS), and they do not get propagated outside that AS. We presume that they are caused
by some a load balancing mechanism. It is interesting to observe that this type
of changes would go unnoticed most of the time if we looked for variations in
the TTL of the TCP connections.
A
traceroute measurement takes between 3 s and 12 s, which is slightly higher
than the average TCP flow duration. On the other hand very few flows are going
to be longer than 10 minutes. In other words these measurements give us an upper
bound on how likely a "common" and a "long-lived" TCP flow
(lasting for a small number of seconds and minutes respectively) are to observe
a route problem. So we can conclude that the probability of having a route
change during the lifetime of a TCP flow is on average smaller than 0.50% for
the common case, and less than 1.73% for long flows.
As we can
see there is only a small increase, from 0.44% to 0.50% in the probability of
seeing a route change as compared to the situation in 1995. The actual increase
in the instability of the network has actually increased even further, if we
consider that the measurements of 1995 considered longer and, therefore, more
complex routes. Nevertheless, this is a small change if we take into account the
growth rate of the network in size and traffic.
We have
studied the routing behavior in the Internet using different end-to-end measurements
on packets. For this we have developed our own tools that can be found in [3]. We have compared the results to what has been found
previously by other researchers. We have found that the Internet stability has
increased slightly, which can be surprising given its increasing complexity. The
probability of seeing a problem along a route path is fairly small (less than
0.5% on average), but it is not negligible, therefore all elements and
protocols in the network should be made resilient to them.
Traceroute
measurements are not a perfect tool, but they are the best we can do from the
end hosts. Traceroute has problems of its own: it does not perform a continuous
sampling and some routers do not provide information if the network is overloaded.
Still, it is probably the best tool to study end-to-end routing behavior,
because it keeps track of how packets are treated at each hop on the path to
the destination.
[1]
NLANR
network traffic packet header traces, http://moat.nlanr.net/Traces/
[2] NLANR traceroute measurements, http://amp.nlanr.net/Active/raw_data/cgi-bin/data_form.cgi
[3] Traceroute analysis tools, http://klamath.stanford.edu/tools/Traceroute/
[4]
C.
Labovitz, A. Ahuja, F. Jahanian. "Experimental Study of Internet Stability
and Wide-Area Network Failures". In the Proceedings of FTCS99, Madison,
WI, June 22, 1999.