Enabling High Performance Data Transfers

System Specific Notes for System Administrators (and Privileged Users)


These notes are intended to help users and system administrators maximize TCP/IP performance on their computer systems. They summarize all of the end-system (computer system) network tuning issues including a tutorial on TCP tuning and easy configuration checks for non-experts.

 

Introduction

Today, the majority of university users have physical network connections that are at least 100 megabits per second all the way through the Internet to every important data center in the world (as well as to every other university user). For many users, that connection might be 1 gigabit per second or faster. In some countries (e.g. Korea and Japan) the same statement applies to every home connection as well: 100 Mb/s from home to all important web servers, data centers and to each other.

To put these data rates into perspective, consider this: 100 Mb/s is more than 10 megabytes in one second, or 600 megabytes (an entire CD-R image) in one minute. Clearly very few people see these data rates. However, some experts can get very high data rates. Why? The biggest strength of the Internet is the way in which the TCP/IP “hourglass” hides the details of the network from the application and vice versa. An unfortunate but direct consequence of the hourglass is that it also hides all flaws everywhere. Network performance debugging (often euphemistically called “TCP tuning”) is extremely difficult because nearly all flaws have exactly the same symptom: reduced performance. For example insufficient TCP buffer space is indistinguishable from excess packet loss (silently repaired by TCP retransmissions) because both flaws just slow the application, without any specific identifying symptoms.

Flaws fall into three broad areas: the applications themselves, the computer system (including the operating system and TCP tuning) and the network path. Each of these areas requires a very different approach to performance debugging. This page is focused on helping users and system administrators optimize the TCP/IP on their computer systems.

  • Applications sometimes perform poorly on long paths (even when the network is perfect) because they are not designed to fully overlap the speed of light delay to deliver the data with the processing at the end systems. It is quite difficult to write complicated applications that do this overlap properly, but it must be done for an application to perform well on a long network path. We have developed some tools and documentation  to help users and application developers to test and debug applications under these conditions. For example secure shell and secure copy (ssh and scp) implement internal flow control using an application level mechanism that severely limits the amount of data in the network, greatly reducing the performance all but the shortest paths. PSC is now supporting a patch to ssh and scp that updates the application flow control window from the kernel buffer size. With this patch, the TCP tuning directions on this page can alleviate the dominant bottlenecks in scp. In most environments scp will run at full link rate or the CPU limit for the chosen encryption.
  • Network paths can be very hard to debug because TCP’s ability to compensate for flaws is inversely proportional to the round trip time (RTT). So for example a flaw that will cause an application to take an extra second on a 1 millisecond path will generally cause the same application to take an extra 10 seconds on a 10 millisecond path. This “symptom scaling” effect arises because TCP’s ability to compensate for flaws is metered in round trips: if a given flaw is compensated in 50 round trips (typical for losses on a medium speed link), then a single loss affects a 1 ms path for only 50 ms, and a 10 ms path for 500 ms. Symptom scaling makes diagnosis particularly difficult, because flaws that are complete show stoppers on long paths are often undetectable on short paths. 

The objectives of this page are to summarize all of the end system network tuning issues, provide easy configuration checks for non-experts, and maintain a repository of operating system specific advice and information about getting the best possible network performance on these platforms.

In the Tutorial we will briefly explain the issues and define some terms. Under High Performance Networking Options we describe each of the optional TCP features that may have to be configured and provide a link to resources with specific host tuning recommendations.

Note that today most TCP implementations are pretty good. The primary flaws are default configurations that are ideal for Local Area Networks (LANs) and Internet back roads: many millions of relatively low speed home users.

Tutorial

The dominant protocol used on the Internet today is TCP, a “reliable” “window-based” protocol. The best possible network performance is achieved when the network pipe between the sender and the receiver is kept full of data.

Bandwidth*Delay Products (BDP)

The amount of data that can be in transit in the network, termed “Bandwidth-Delay-Product,” or BDP for short, is simply the product of the bottleneck link bandwidth and the Round Trip Time (RTT). BDP is a simple but important concept in a window based protocol such as TCP. Some of the issues discussed below arise because of the fact that the BDP of today’s networks has increased way beyond what it was when the TCP/IP protocols were initially designed. In order to accommodate the large increases in BDP, some high performance extensions have been proposed and implemented in the TCP protocol. But these high performance options are sometimes not enabled by default and will have to be explicitly turned on by the system administrators.

Buffers

In a “reliable” protocol such as TCP, the importance of BDP described above is that this is the amount of buffering required in the end hosts (sender and receiver).  If the BDP is small either because the link is slow or because the RTT is small (in a LAN, for example), the default configuration is usually adequate. But for paths that have a large BDP, and hence require large buffers, it is necessary to have the high performance options discussed in the next section be enabled.

Computing the BDP

To compute the BDP, we need to know the speed of the slowest link in the path and the Round Trip Time (RTT).

The peak bandwidth of a link is typically expressed in Mbit/s (or more recently in Gbit/s). The round-trip delay (RTT) for wide area links is typically between 1 msec and 100 msec, which can be measured with ping or traceroute

As an example, for two hosts with GigE cards, communicating across a coast-to-coast link over Internet2, the bottleneck link will be the GigE card itself. The actual round trip time (RTT) can be measured using ping, but we will use 70 msec in this example.

Knowing the bottleneck link speed and the RTT, the BDP can be calculated as follows:

1,000,000,000 bits 


1 second

*

1 Byte 


8 bits

*0.07 seconds = 8,750,000 bytes = 8.75 Mbytes

Based on these calculations, it is easy to see why the typical default buffer size of 4 MBytes would be inadequate for this connection. With 4 MBytes you would get only half of the available bandwidth.

The next section presents a brief overview of the high performance options. Specific details on how to enable these options in various operating systems is provided in a later section.

High Performance Networking Options

The options below are presented in the order that they should be checked and adjusted.

  1. Maximum TCP Buffer (Memory) space: All operating systems have some global mechanism to limit the amount of system memory that can be used by any one TCP connection.On some systems, each connection is subject to a memory limit that is applied to the total memory used for input data, output data and control structures. On other systems, there are separate limits for input and output buffer space for each connection.Today almost all systems are shipped with Maximum Buffer Space limits that are too small for today’s Internet. Furthermore the procedures for adjusting the memory limits are different on every operating system.
  2. Socket Buffer Sizes: Most operating systems also support separate per connection send and receive buffer limits that can be adjusted by the user, application or other mechanism as long as they stay within the maximum memory limits above. These buffer sizes correspond to the SO_SNDBUF and SO_RCVBUF options of the BSD setsockopt() call.The socket buffers must be large enough to hold a full BDP of TCP data plus some operating system specific overhead. They also determine the Receiver Window (rwnd), used to implement flow control between the two ends of the TCP connection. There are several methods that can be used to adjust socket buffer sizes:
    • TCP Autotuning automatically adjusts socket buffer sizes as needed to optimally balance TCP performance and memory usage. Autotuning is based on an experimental implementation for NetBSD by Jeff Semke, and further developed by Wu Feng’s DRS and the Web100 Project.
    • The default socket buffer sizes can generally be set with global controls. These default sizes are used for all socket buffer sizes that are not set in some other way. For single user systems, manually adjusting the default buffer sizes is the easiest way to tune arbitrary applications. Again, there is no standard method to do this, you must refer to operating system-specific procedures.
    • Since over buffering can cause some applications to behave poorly (typically causing sluggish interactive response) and risk running the system out of memory, large default socket buffers have to be considered carefully on multi-user systems. We generally recommend default socket buffer sizes that are slightly larger than 64 kBytes.
    • For custom applications, the programmer can choose the socket buffer sizes using a setsockopt() system call.
    • Some common applications include built in switches or commands to permit the user to manually set socket buffer sizes. The most common examples include iperf (a network diagnostic), many ftp variants (including gridftp) and other bulk data copy tools. Check the documentation on your system to see what is available.
    • This approach forces the user to manually compute the BDP for the path and supply the proper command or option to the application.
    • There has been some work on autotuning within the applications themselves. This approach is easier to deploy than kernel modifications and frees the user from having to compute the BDP, but the application is hampered by having limited access to the kernel resources it needs to monitor and tune.
  3. TCP Large Window Extensions (RFC1323): These enable optional TCP protocol features (window scale and time stamps) which are required to support large BDP paths.
    • The window scale option (WSCALE) is the most important RFC1323 feature, and can be quite tricky to get correct. Window scale provides a scale factor which is required for TCP to support window sizes that are larger than 64 KBytes. Most systems automatically request WSCALE under some conditions, such as when the receive socket buffer is larger than 64 KBytes or when the other end of the TCP connection requests it first. WSCALE can only be negotiated at the very start of a connection. If either end fails to request WSCALE or requests an insufficient value, it cannot be renegotiated later during the same connection. Although different systems use different algorithms to select WSCALE they are all generally functions of the maximum permitted buffer size, the current receiver buffer size for this connection, or in some cases a global system setting.
      Note that under these constraints (which are common to many platforms), a client application wishing to send data at high rates may need to set its own receive buffer to something larger than 64 KBytes before it opens the connection to ensure that the server properly negotiates WSCALE.
    • A few systems require a system administrator to explicitly enable RFC1323 extensions. If system cannot (or does not) negotiate WSCALE, it cannot support TCP window sizes (BDP) larger than 64k Bytes.
    • Another RFC1323 feature is the TCP Timestamp option which provides better measurement of the Round Trip Time and protects TCP from data corruption that might occur if packets are delivered so late that the sequence numbers wrap before they are delivered. Wrapped sequence numbers do not pose a serious risk below 100 Mb/s, but the risk becomes progressively larger as the data rates get higher.
      Due to the improved RTT estimation, many systems use timestamps even a low rates.
  4. TCP Selective Acknowledgments Option (SACK, RFC2018) allow a TCP receiver inform the sender exactly which data is missing and needs to be retransmitted.Without SACK TCP has to estimate which data is missing, which works just fine if all losses are isolated (only one loss in any given round trip). Without SACK, TCP often takes a very long time to recover following a cluster of losses, which is the normal case for a large BDP path with even minor congestion. SACK is now supported by most operating systems, but it may have to be explicitly turned on by the system administrator.If you have a system that does not support SACK you can often raise TCP performance by slightly starving it for socket buffer space, The buffer starvation prevents TCP from being able to drive the path into congestion, and minimize the chances of causing clustered losses.
  5. Path MTU The host system must use the largest possible MTU for the path. This may require enabling Path MTU Discovery (RFC1191, RFC1981, RFC4821).Since RFC1191 is flawed it is sometimes not enabled by default and may need to be explicitly enabled by the system administrator. RFC4821 describes a new, more robust algorithm for MTU discovery and ICMP black hole recovery.

Note that both ends of a TCP connection must be properly tuned independently, before it will support high speed transfers.

For further information about end  host tuning and instructions for specific operating systems, see the Energy Sciences Network fasterdata web pages.

 

Acknowledgments

Jamshid Mahdavi maintained this page for many years, both at PSC and later, remotely from Novell. We are greatly indebted to his vision and persistence in establishing this resource.

Thanks Jamshid!

Many, many people have helped us compile this information. We want to thank everyone who sent us updates, additions and corrections.

This material has been maintained as a sideline of many different projects, nearly all of which have been funded by the National Science Foundation. It was started under NSF-9415552, but also supported under Web100 (NSF-0083285) and the NPAD project (ANI-0334061).

Matt Mathis and Raghu Reddy