Skip to content

Active Measures to Maintain a Healthy Network

SDSC RESEARCH | Contents | Next
PARTICIPANTS
Hans-Werner Braun
SDSC

Tony McGregor
University of Waikato, Hamilton, New Zealand

M odern computer networks--including the Internet--are designed to "work around" infrastructure defects.But this leads to a dilemma: If the system operates successfully even when there are problems, how canthe people who run it determine when and where the network isn't performing as well as it should? Five years ago, the NSF established the National Laboratory for Applied Network Research (NLANR) to provide coordinated support services for advanced networks such as the vBNS and Abilene. An NLANR research group at SDSC is evolving a network analysis infrastructure for the academic high-performance computing community and is deploying a constellation of small monitoring devices throughout several networks to assess their health. These activities help with early identification of problems, and they enable network engineers and administrators to improve the stability, performance, and quality of service of existing networks and to design future networks.



nlanr_big-cmyk

Figure 1. The AMP Constellation
AMP monitors are deployed at 114 sites in the United States. Two additional sites in New Zealand and Norway are not shown.
"Our Network Analysis Infrastructure provides both engineering and research support for the academic HPC community by collecting and making available raw data, analyses, and visualizations of network measurements," said SDSC research scientist Hans-Werner Braun, who founded the NLANR program. Before coming to SDSC, Braun was a principal investigator for the NSFNET backbone, which in the late 1980s was a key factor in transforming the research-oriented, restricted-access ARPAnet into the modern Internet.

NLANR is a distributed organization, with groups at four sites. The other groups assist users of high-performance applications and provide technical support to network engineers. NLANR's Measurement and Analysis team, located at SDSC and led by Braun, develops and deploys tools for data acquisition and analysis of high-performance networks, and measures and analyzes traffic data for the vBNS, Abilene, and other high-performance networks.

ACTIVE NETWORK MONITORING

The Active Measurement Project (AMP) is an important part of NLANR's network analysis infrastructure. It began in 1998 with a suggestion by Tony McGregor, a Computer Science faculty member at the University of Waikato in New Zealand, and a single computer that pinged other universities' networks--sending short messages and "listening" for responses. McGregor spent half of 1998 on sabbatical at SDSC, where he and the NLANR team expanded the concept by adding support for multiple monitors, a separate data server, real-time data transfer, and a sophisticated data analysis package.

After his return to New Zealand, McGregor continued to work with NLANR, and currently is AMP project leader. The first AMP machines were installed at remote sites in December 1998. By July 2000, the project had deployed a constellation of 114 monitors across the United States, and more machines may be added. Monitors also are operating in New Zealand and Norway.

The AMP monitors--fast PCs with network interfaces--send messages through the network and observe the results. They perform site-to-site measurements of such variables as round-trip time (RTT), packet loss, path topology, and throughput across the vBNS and Abilene networks. Each AMP unit sends this information to the central AMP site at SDSC for processing and publication via the Web.

Top| Contents | Next

NETWORK HEALTH STATISTICS

The AMP research team analyzes network performance at and between these monitors and makes the "network health" statistics available to researchers and engineers in the form of numeric data and graphics. A Web page provides hyperlinks for the monitor sites; for each site, a table presents the RTT and loss from there to all of the other sites. If a secondary site from this table is selected, the page displays a year of RTT and loss data for the pair of sites. Additional hyperlinks lead to detailed displays for any day and give the RTT by time of day and as a frequency distribution.

For 116 monitors, there are 13,340 point-to-point paths for which statistics are gathered once per minute (Figure 1). With several gigabytes of data published every day, how can researchers and engineers find the "interesting" events that may indicate faults?

Visualization tools are one solution. "We make our visualization and analysis software freely available to other researchers," Braun said. "We also invite collaborations, and we particularly encourage faculty and graduate students to consider the ample opportunities for thesis work based on our measurements."

"We are also developing several heuristics for automatic event detection," said McGregor at a recent conference. "Inspection of statistics and trends in the RTT --noise, jitter, spikes, slow rises, and plateaus--and comparisons with historical samples can help diagnose problems. Correlations between RTT events for various point-to-point paths on the network topology can help locate a fault. Our long-term objective is for the system to automatically send a notice to a network administrator that 'your link is flaky, and we recommend the following corrective actions...'" --MG *