NPAD Diagnostics Servers

Automatic diagnostic server for troubleshooting end-systems and last-mile network problems
On this page:

The NPAD diagnostic server, Pathdiag, is designed to easily and accurately diagnose problems in the last-mile network and end-systems that are the most common causes of all severe performance degradation over long end-to-end paths. Our goal is to make the test procedures easy enough and the report it generates clear enough to be suitable for end-users who are not networking experts. In most situations a single test run, launched from a web page, will generate a report that enumerates all problems affecting downloading (fetching) of data from a remote site. Although the report contains extensive explanations of the results, we do not assume that end-users will be able to correct network problems themselves. The reports include guidance to help end-users properly engage a system or network administrator and the necessary information to help the administrator locate the problem.

Table of Contents

Support Status

This is still an experimental service. The procedures and reports are not quite as clear and easy to use as we would like. There is still room for improvement. Individual servers may be down for extended intervals and we reserve the right to make changes in the future. You can help us improve this service by using it and providing feedback. We are particularly interested in cases where the documentation or results are inaccurate, incomplete, or misleading.

BTW: If you are hot on the trail of a network performance problem and pathdiag is not helpful, please get us involved before you fix your network problem, so we can debug pathdiag on live measurement results. If you manage to find a situation that confuses pathdiag you have an opportunity to get some free network consulting while we figure out why it missed the mark!

Please send questions, comments and suggestions to nettune@psc.edu

Introduction

The NPAD project addresses the set of problems associated with end-hosts and their connections (the "last-mile") to a high speed backbones network.

Universities and research institutions are typically connected to high speed backbone through a GigaPoP or other network providing regional traffic aggregation. Since backbones such as Internet2 and ESnet are generally well provisioned and monitored, when there is a performance problem it is usually within the edge network or somewhere along its connection to the GigaPoP, or in the end-system itself. But as described in the next section, TCP's robustness in the presence of flaws often makes it difficult for local tests to detect and troubleshoot these problems in the last mile.

The diagnostic servers made available though this project are intended to help in troubleshooting these performance problems. There are two ways for end-user to access these diagnostic servers:

In addition, expert users will (in the future) be able to run the pathdiag tool in standalone mode without the web-server framework. This will permit networking experts to use local techniques for diagnosing flaws in the interior sections of their network.

Theory and Method

Network performance debugging, often called "TCP tuning", is an extremely difficult task because nearly all flaws have identical symptoms: reduced performance (data throughput). For example, if the network card is dropping packets because of a bad cable, the lost packets are silently retransmitted by the TCP retransmission algorithm. The user would never observe missing data or data corruption. The only symptom is that the connection took a little longer than it should have, while the missing data was retransmitted.

Nearly all network flaws, including improper configurations, bad interface card and cables, etc. have the same symptoms: reduced performance.

The consequences of this "single symptom" property are compounded by another effect: TCP's ability to compensate for flaws is inversely proportional to the round trip time (RTT) of the path being tested. For example, a flaw that causes an application to take an extra second on a 1 millisecond path will generally cause the same application to take an extra 10 seconds on a 10 millisecond path. This "symptom scaling" effect arises because TCP's ability to compensate for flaws is metered in "round trips" or RTTs: if a given flaw is compensated in 50 round trips (typical for losses on a medium speed link), then a single loss affects a 1 ms path for only 50 ms, whereas a 10 ms path will be affected for 500 ms.

Symptom scaling makes diagnosis particularly difficult, because flaws that are show-stoppers on long paths may be undetectable on short paths.

Anybody who has been involved much in network diagnosis is likely to have run into the following situation:

             client                 Server
               |                      |
               +-+----------------+---+
               A B                C   D

Say you are trying to debug an application on a long path from (A) to (D) that passes through (B) and (C). You can easily test (A) to (B) and (C) to (D), both of which pass your tests, so you think you can inductively "prove" that the flaw is between (B) and (C). But the truth may be that the real flaw is between (A) and (B), which has a very short RTT, so the flaw is effectively masked by TCP. The flaw is only detectable with long RTT connections that include not only the section from (A) to (B) but also a high delay section such as the one from (B) to (C).

In a nutshell this is why the "end-to-end" problem is so persistently difficult: there is only one symptom: reduced performance and that one symptom is proportional to RTT, such that the vast majority of local flaws are undetectable with local tests.

The pathdiag tool accounts for RTT scaling effects by taking advantage of the instruments available in a Web100 instrumented kernel. In order to do this, pathdiag needs to know some key parameters of the TCP connection over the long path: the target data rate for the application, the round trip time of the entire path, and (in the future) any MTU limit imposed elsewhere in the path. Continuing our example above, by knowing the RTT between (A) and (D), the target data rate for the application, and by measuring the effect of any flaws in the path from (A) to (B) it can estimate the impact of these flaws on the application running over the entire path from (A) to (D).

Unlike other testing methodologies, pathdiag gets more sensitive as you shorten the path section from (A) to (B). (e.g. pick a new (B) closer to (A)). If the RTT is small enough, flaws that are show-stoppers for the entire path do not interfere with other diagnostic tests, permitting a single pathdiag run to detect multiple flaws. Typically, when debugging a long end-to-end path with conventional techniques, each flaw has to be diagnosed and corrected before you can even detect the next flaw - debugging on a long path is highly serial. With pathdiag, a single run is likely to fully diagnose multiple flaws.

Although pathdiag can be deployed in a number of ways, the approach of embedding the tool in diagnostic servers at a number of GigaPoPs makes it easy to diagnose networks flaws at the edges of the network.

The server itself is located at (B), typically in a GigaPop or near the edge of a high speed backbone. The diagnostic client that runs at (A) is either a lightweight Java applet that can run in any standard web browser or a simple C program that can be compiled on any unix-like system.

Note that the data has to flow from the diagnostic server at (B) towards the client at (A). This is because pathdiag relies on the Web100 instrumentation in the TCP sender to measure critical TCP parameters. For most applications, where a user at (A) is retrieving data from (D) this is the correct direction for the test. If the primary flow is in the opposite direction, pathdiag may not be able to detect some flaws. However, since most flaws affect data flowing in both directions, most would still be diagnosed.

Procedure for Using the NPAD Diagnostic Servers

To test your network connection with pathdiag, you need to do the following things:

Current NPAD Diagnostic Servers

Select the NPAD diagnostic server that is the closest to you in terms of network round trip time. This will generally be the geographically closest server connected to the same national backbone as you are. The servers below are organized by connected backbone and sorted east to west.

Interpreting the Results

When you go the nearest NPAD server and run a diagnostic test as suggested above, pathdiag returns a web page which reports all of the test results. The messages indicate which tests passed or failed, and appropriate actions for further debugging. Consider bookmarking each report so you can refer back to an earlier test, or forward it on to an expert for further analysis.

Briefly, the results page shows the following:

Outcomes

The NPAD diagnostic server can detect nearly all flaws in the last mile and end-system under test. But it cannot repair the flaws, nor can it detect flaws elsewhere, so once you have test results in hand you have to use them to get the right people to take corrective action and/or perform additional tests.

For this reason it is especially important that you keep good notes of your experiments and record the results (add the reports to your bookmarks or favorites). When you report a problem to somebody else, expect to be asked for the test results. We suspect that most people would rather that you paste the report URL into email than send the entire report as an attachment.

The test outcomes fall into several broad categories:

End system (target or client) flaws

These are flaws in the computer system that is acting as the test target (the web client) at one end of the path under test. They are best corrected by having a system administrator refer to the detailed tuning directions at PSC's TCP tuning page or the similar pages at LBL. Note that some operating systems may be missing required TCP features. Such systems cannot be expected to perform well and should to be upgraded or replaced.

In most organizations, the networking group is only responsible for the network as far as the connector on the wall. Generally they cannot (or will not) make changes to computer systems which are not theirs. Only the owner or a properly authorized system administrator should make changes to the end system.

Path flaws

To further localize the flaws, test shorter subsections of this path or partial alternate paths by using additional testers and targets. Since there can be hidden switches and other invisible infrastructure, it is rarely effective to debug a network path without participation by the responsible network engineer. Unless you have access to the physical network and software configurations, you should not try to debug the path, except for a couple of specific checks:

Do NOT attempt to do detailed path debugging unless you have access to both the physical network (e.g. keys to the closets) and the configurations of the switches and routers (e.g. passwords), as well as the details of the network design. Modern networking gear can have a complex logical (virtual) topology that is entirely different than the physical topology. Unless you know exactly how that data flows through the hardware, you cannot locate flaws using intuitive debugging techniques.

If you have the access to the physical network and configurations, the easiest way to debug the path using an NPAD diagnostic server is to connect a portable diagnostic client to various places in the network, either by physically carrying a laptop to various wiring hubs or connecting it logically by reconfiguring vLANs.

In the future, we plan to support a standalone version of pathdiag, that does not use the web-based client-server framework described in this document. This "expert mode" will permit much greater flexibility in placing testers and targets at arbitrary locations in the network, at the expense of requiring significantly more expertise to configure and deploy.

Tester flaws

Often tester flaws are not persistent, and will not be repeated on later runs of the same tests. If they do, flaws that seem to be related to this particular server (e.g. server bottlenecks) should be reported to the site contact for the server. Flaws that may indicate oversights or bugs in the tester itself (e.g. messages about unexpected events) should be reported to nettune@psc.edu. In any case we periodically retrieve results from public NPAD diagnostic servers and inspect the reports for accuracy. We pay particular attention to all reports indicating tester problems.

No path or target flaws

If the target and path both pass all tests, you should be done, and if you are lucky, your application will work. If not, you need to test the path with a traditional end-to-end diagnostic tool (e.g. iperf, ttcp, etc). If the traditional diagnostic test fails:

If the traditional end-to-end diagnostic test passes:

Glossary

End-System

The computer system at one end of a network connection or path. While this term can encompass any device that can be connected to the network, in this document it most frequently refers to a PC or computer system used by end-users.

End-User

A network or application user who is an expert in something other than networking, computer systems or network applications - a typical user.

End-to-end path (or test)

The path all the way from one end-system to another.

Flaw

An imperfection, often concealed, that impairs soundness (www.dictionary.com).

Any defect in hardware / software / configuration with respect to the network connection of a host or a network component such as a switch or a router.

Last-mile

The part of the network that goes from a host to the high speed backbone such as Internet2, ESNet, etc.

[MSMO97]

"The Macroscopic Behavior of the TCP Congestion Avoidance Algorithm", a paper by Matthew Mathis, Jeffrey Semke, Jamshid Mahdavi, and Teunis Ott, Computer Communications Review, volume 27, number 3, July 1997, that introduced the one over the square root of the loss rate model for TCP performance.

Pathdiag server (or service)

A web wrapper that makes pathdiag easy to run with no requirement to install software on the local machine

Pathdiag client

A small program that a user uses to invoke a test from a pathdiag server back to the user's machine. If you are using the directions on this page, the client is generally your web browser, which is also the test target.

Pathdiag (stand alone)

The pathdiag tool that can be run without the server. This will require a test host with a web100 kernel and other supporting software, and will be covered under a future document on advanced pathdiag techniques.

(Path) section

A part of an end-to-end path. The first step to debugging a long network path is often determining which section has a flaw.

Performance Measurement Point

A "landmark" along a long network path, used to determine which section of the long path has a flaw, by providing a stable, well known end-system for testing.

Single Symptom

Situation in which many different varieties of flaws at various locations all have the same symptom, reduced performance.

Symptom

A characteristic sign or indication of the existence of something not being right.

Something a user may observe in the presence of a flaw that only indicates that something is wrong, but not identify to the location or the nature of the problem.

Symptom Scaling

Situation in which a symptom caused by a flaw which is clearly observable on a long path is almost undetectable when tested on a short path. The observable symptom scales with the Round Trip Time (RTT) of the path.

This can be a serious impediment in diagnosing the problem because testing on a short path is easy and can be done in a controlled environment whereas testing on long path introduces many unknown variables beyond the control of an organization.

Receiver Window

The portion of the TCP protocol that implements flow control. When the receiving application slows down, it signals the sending application by closing the receiver window. Note that the receiver window is actually the amount of free space in the TCP receivers buffers and therefore constrained to be smaller than the receiver's TCP buffer size.

Target

Pathdiag tests the network using a TCP connection between the tester and target. If you are using the directions on this page, the target is always the same as the pathdiag client.

Target data rate

The user specified data rate, which is the goal for the application over the entire end-to-end path.

Target round trip time (RTT)

The user specified round trip time of the entire end-to-end path.

TCP (socket) buffer size

The amount of buffer space that TCP is permitted to use to store unacknowledged data (on the sending side) or undelivered data (on the receiving side).

Tester

Pathdiag runs in the tester to test the path between the tester and target. If you are using the directions on this page, the tester is always the same as the pathdiag server.


About NPAD

Network Path and Application Diagnosis is a joint project of the PSC and NCAR, funded under NSF grant ANI-0334061. This project is focused on using Web100 and other methods to extend fairly standard diagnostic techniques to compensate for the "symptom scaling" that leads to false positive diagnostic results on short paths.

Matt Mathis, John Heffner, and Raghu Reddy
Please send comments and suggestions to nettune@psc.edu