Study Of Routing Behavior Through Traffic Analysis and Traceroute Measurements

Pablo Molinero-Fernández, Nick McKeown, Stanford University

Abstract

We have studied the routing behavior in the Internet, and its evolution through time. The goal is to obtain some heuristics that can be used to optimize any soft state that is maintained in the network. We have done this work without having direct access to the routing protocol exchanges along the paths that we have observed. We analyzed packet traces and traceroute measurements from NLANR [1][2]. This method is similar to the one used by Vern Paxson in 1994 and 1995 [5]. By using similar measurements we can see whether the instability of the paths in the Internet has changed in recent years, as the Internet has changed a lot since 1995. Among other things the network has grown in size, traffic and capacity, and short HTTP bursts have become the dominant type of traffic.

Introduction

The reliability and stability of the network is in many ways dictated by the routing behavior. Analyzing this stability is important to locate and understand the cause of some of the problems in the network. It is also very useful in case we want to use some heuristics to design some network mechanism or protocol. For example, if one wanted to maintain in the routers some quality-of-service state that is bound to TCP flows, the inactivity timeouts for releasing this state are determined by the probability of seeing a route change or a connection failure during the lifetime of a TCP connection.

A way to study the routing instabilities is done by observing the exchanges of routing protocol messages between routers, as they reflect the view of the network that routers have. Between 1997 and 1998 Labovitz, Ahuja and Jahanian [4] studied these routing protocol exchanges and the incidence reports from the network operator. They state that the median mean time between routes changes is between 16 hours and 1.8 days for inter-domain changes (i.e. route changes spanning several domains) and between 55 and 72 minutes for intra-domain changes. If we consider TCP flows to last between 3 and 7 seconds, a flow will typically have a probability of 0.002-0.012% of being rerouted for inter-domain routes, and 0.07-0.21% for intra-domain routes.

As for connectivity failures, they are less common. The same authors report a median mean time between failures between 8 and 11 days for inter-domain routes, and between 25 and 45 days for intra-domain routes. The probabilities of seeing a failure for a typical TCP flow will be around 3.2-10*10^-6 for inter-domain routes, and 0.7-3.2*10^-6 for intra-domain routes.

However, one does not always have access to the routing information that is exchanged between the nodes in the network. In this case one has to resort to measurements on how the packets themselves are treated by the network when there is a change in routing. We used the program traceroute to see what routers are being visited by the packets sent between two end hosts.

Between 1994 and 1995 Vern Paxson [5] performed traceroute measurements to study the routing behavior of the network. He reports that about 2/3 of the routes remain unchanged for over days or weeks. Paxson also found out that in 1995 (1994) 0.44% (0.16%) of the measurements saw a route change, whereas 2.7% (1.2%) of the measurements had route failures. The reason for these higher figures is that traceroute also reflects route changes along all domains that it traverses. The longer the path the more likely it will be rerouted. In the case of the route failure the measurements do not distinguish between a connectivity failure and one of the end-host being down.

Traceroute Measurements

The network has changed a lot in the past five years. HTTP has become the dominant type of traffic in the Internet. This has led to smaller flows, also known as mice, which do not last long enough as to perform congestion control efficiently. Since 1995, the Internet has become more pervasive; it has grown in number of users and capacity of the backbone. Finally, access networks are much faster now than 5 years ago.

For these reasons we have revisited the results from Vern Paxson, to see if there have been any major changes since then. We used traceroute measurements that are available at NLANR [2].

The first thing to compare between Paxson's study and ours is the number of hops. Paxson used quite a few nodes outside of the US; the result is that the average hop count is 15.9, with the 80th-percentile being in 20.0 hops. In our study, most hosts were in research institutions in the US, so the average hop count is 11.9, with the 80-th percentile in 15.0 hops. Another difference is that Paxson used exponential sampling, whereas we have used a periodic sampling. One can see our method as oversampling: If we just chose samples that are exponentially distributed, then we could eliminate the rest.

We analyzed the measurements of the 440 paths between 22 sites during the whole month of October 2000. We had over 4300 measurements per path. Because of the long interval between the samples (10 mins.), we can assume that the successive traceroutes are independent, and that any transient effects have stabilized by the next measurement. We developed our own tools [3] to visualize and categorize the changes that happened in the route between two measurement points, as the following figure shows.


Figure 1.- Summary of all paths taken by the traceroute measurements

The numbers in black on the top represent the last byte of the IP address of the router. A number like X14X means that there was no answer from the router that is 14 hops away. The numbers in blue at the bottom represent the percentage (or per mille) of routes that traverse that particular router. The thickness of the line is proportional to that percentage.

What will really allow us to observe what is happening is the graphing of the time evolution of the traceroute measurement as shown in the figure below. When consecutive samples take the same path they are merged and the line is thickened. When there is a change in the route, the new path is plotted.


Figure 2.- Graph showing the evolution with time of the measurements

In this example, there are 898 traceroute samples considered. The first 223 measurements (between Oct 1 at 0:03 and Oct 2 at 13:03, every 10 minutes) took the same route, going from Stanford University to University of Colorado at Boulder through CalRen2 and Abilene. Then between 13:03 and 13:26 there was at least one route change, where the path goes through bbnplanet and qwest. Note that we cannot know whether there has been more than one change, because we are not doing a continuous sampling, and for some time no packets were went from Stanford to Colorado.

Then there is another route change before 13:30, and after that there are 168 samples using the same path as at the beginning. Between 17:23 and 17:53 the probe arrives to U. Colorado, and it does not get an answer from the destination for three consecutive measurements. At 18:03 that answer returns, the destination came back up. Note that the destination is the only node that does not have to answer to an expired TTL, since the probe is a packet directly addressed to it.

At 18:13 the route returns to the original one, and it stays there for another 423 samples. Before 16:43 there is still one change inside of the Stanford domain, and then at 16:53 it returns to the original path, where it stays until the end.

Our tool can generate these and other graphs, and it can automate the analysis of the trace. For more information, please visit http://klamath.stanford.edu/tools/Traceroute/index.html

 

average

median

80%-percentile

90%-percentile

Frequency of a route change in the 10-min period
BETWEEN measurements

1.73%

0.72%,

1.82%,

4.79%


Frequency of a route change DURING the traceroute
measurement

0.50%,

0.02%,

0.21%,

1.14%


Frequency of a route failure (transient or longer
term) DURING the traceroute measurement

0.84%,

0.00%,

0.20%,

1.58%

Table 1.               Summary of probabilities of route changes and failures observed with traceroutes over a one-month period

We can see heavy tails in the distribution, this is because there are a few routes that have repetitively a rather high number of route changes. These are usually produced by routing flaps between two routers inside the same autonomous system (AS), and they do not get propagated outside that AS. We presume that they are caused by some a load balancing mechanism. It is interesting to observe that this type of changes would go unnoticed most of the time if we looked for variations in the TTL of the TCP connections.

A traceroute measurement takes between 3 s and 12 s, which is slightly higher than the average TCP flow duration. On the other hand very few flows are going to be longer than 10 minutes. In other words these measurements give us an upper bound on how likely a "common" and a "long-lived" TCP flow (lasting for a small number of seconds and minutes respectively) are to observe a route problem. So we can conclude that the probability of having a route change during the lifetime of a TCP flow is on average smaller than 0.50% for the common case, and less than 1.73% for long flows.

As we can see there is only a small increase, from 0.44% to 0.50% in the probability of seeing a route change as compared to the situation in 1995. The actual increase in the instability of the network has actually increased even further, if we consider that the measurements of 1995 considered longer and, therefore, more complex routes. Nevertheless, this is a small change if we take into account the growth rate of the network in size and traffic.

Conclusion

We have studied the routing behavior in the Internet using different end-to-end measurements on packets. For this we have developed our own tools that can be found in [3]. We have compared the results to what has been found previously by other researchers. We have found that the Internet stability has increased slightly, which can be surprising given its increasing complexity. The probability of seeing a problem along a route path is fairly small (less than 0.5% on average), but it is not negligible, therefore all elements and protocols in the network should be made resilient to them.

Traceroute measurements are not a perfect tool, but they are the best we can do from the end hosts. Traceroute has problems of its own: it does not perform a continuous sampling and some routers do not provide information if the network is overloaded. Still, it is probably the best tool to study end-to-end routing behavior, because it keeps track of how packets are treated at each hop on the path to the destination.

References

[1]   NLANR network traffic packet header traces, http://moat.nlanr.net/Traces/

[2]   NLANR traceroute measurements, http://amp.nlanr.net/Active/raw_data/cgi-bin/data_form.cgi

[3]   Traceroute analysis tools, http://klamath.stanford.edu/tools/Traceroute/

[4]   C. Labovitz, A. Ahuja, F. Jahanian. "Experimental Study of Internet Stability and Wide-Area Network Failures". In the Proceedings of FTCS99, Madison, WI, June 22, 1999.

[5]   V. Paxson. "End-to-end routing behavior in the Internet". In Proceedings of ACM/SIGCOMM '96, pages 25-38, Stanford, CA, Aug 1996.