High Performance Enabled SSH/SCP
Abstract: SCP and the underlying SSH protocol is network performance limited by statically defined internal flow control buffers. These buffers often end up acting as a brake on the network throughput of SCP especially on long and wide paths. Modifying the ssh code to allow the flow control buffers to be defined at run time eliminates this bottleneck.
Problem
High Bandwidth and High Latency links are becoming more prevalent in corporate and academic institutions. Applications that use windowing thus need to ensure that the window size is at least equal to the Bandwidth Delay Product, or BDP, are to obtain maximum utilization of the link. The BDP is the product of the narrowest portion of the network path and the round trip delay time and represents the total data carrying capactity of the path. For TCP it is already possible to tune the tcp window size manually or use an autotuning mechanism, such as the Web100 linux kernel patch to ensure maximum throughput with TCP. However, when applications above the TCP layer implement windowing, the limitation on throughput then becomes the less of either TCP or the application. In OpenSSH the limitation appears in the static window sizes that appear in channels.h as defined values.
Solution
Modifying the static size to be a larger value would only serve to waste space in the event that it is larger than the underlying protocol’s window size. Asking the user to specify the size also presents the problem of requiring users to be knowledgable in network performance tuning. Adjusting the size of window to be large enough so that it is no longer the limitation on throughput, but not much larger than it needs to be in order to obtain the desired performance would be the ideal solution.
There were only two changes needed to adjust the SSH window based on the TCP window. One was to enable the buffer code to allocate larger sizes. This was done using a variable that replaced the constant that was the maximum size allowed by the buffer code, and a function to modify the variable’s default value to something larger. The second change was to get the TCP window size from getsockopt and adjust the window size to match, but only if the new size was larger than the old one. The returned value from getsockopt is also doubled because OpenSSH only sends a WINDOW_ADJUST message when the window is half full in order to save on the number of WINDOW_ADJUST messages sent with a cost of doubling the buffer size.
Tests
The following hosts were used in the performance tests. kirana was running a 2.6 linux kernel with the Web100 patch. tg-login was runing a 2.6 kernel without autotuning, but a tcp window size of 10,000,000 bytes. The link BDP of a 1Gbps with a 0.04 second delay is 40,000,000 bits or 5,000,000 bytes. The 300MB file was copied from /dev/shm on one machine to /dev/null on the other.
Hosts:
- kirana.psc.edu
- Dual PIII 1.0Ghz (Coppermine)
- 1Gig RAM
- GigaBit Ethernet
- tg-login.ncsa.teragrid.org
- Quad Itanium2 1.3Ghz
- 8Gig Ram
- GigaBit Ethernet
Traceroute log:
| 1 | bar-kirana-ge-0-2-0-0.psc.net | (192.88.115.169) | 0.292 ms | 9.452 ms | 0.204 ms |
| 2 | beast-bar-g4-0-1.psc.net | (192.88.115.18) | 0.129 ms | 0.099 ms | 0.094 ms |
| 3 | abilene-psc.abilene.ucaid.edu | (192.88.115.124) | 9.801 ms | 9.792 ms | 9.805 ms |
| 4 | nycmng-washng.abilene.ucaid.edu | (198.32.8.84) | 14.042 ms | 14.036 ms | 14.138 ms |
| 5 | chinng-nycmng.abilene.ucaid.edu | (198.32.8.82) | 34.341 ms | 41.711 ms | 34.326 ms |
| 6 | mren-chin-ge.abilene.ucaid.edu | (198.32.11.98) | 34.421 ms | 34.466 ms | 34.417 ms |
| 7 | sbr0-lsd6509.gw.ncsa.edu | (198.17.196.1) | 36.957 ms | 36.949 ms | 36.920ms |
| 8 | acb-2-vlan101.gw.ncsa.edu | (141.142.0.6) | 37.010 ms | 36.957 ms | 36.943ms |
| 9 | core-10-acb-2.gw.ncsa.edu | (141.142.0.133) | 37.091 ms | 36.965 ms | 36.958 ms |
| 10 | hg-core-core-10.gw.ncsa.edu | (141.142.0.138) | 38.300 ms | 38.866 ms | 38.312 ms |
| 11 | hg-1-hg-core.ncsa.teragrid.org | (141.142.47.34) | 38.739 ms | 39.187 ms | 38.340 ms |
| 12 | tg-login1.ncsa.teragrid.org | (141.142.48.5) | 36.996 ms | 36.959 ms | 36.950 ms |
Unmodified SCP Performance
| 3des-cbc | 1.3MB/s |
| arcfour | 1.9MB/s |
| aes192-cbc | 1.8MB/s |
| aes256-cbc | 1.8MB/s |
| aes128-ctr | 1.9MB/s |
| aes192-ctr | 1.8MB/s |
| aes256-ctr | 1.8MB/s |
| blowfish-cbc | 1.9MB/s |
| cast128-cbc | 1.7MB/s |
| rijndael-cbc@lysator.liu.se | 1.8MB/s |
Modified SCP Performance
| 3des-cbc | 2.8MB/s |
| arcfour | 24.4MB/s |
| aes192-cbc | 13.3MB/s |
| aes256-cbc | 11.7MB/s |
| aes128-ctr | 12.7MB/s |
| aes192-ctr | 11.7MB/s |
| aes256-ctr | 11.3MB/s |
| blowfish-cbc | 16.3MB/s |
| cast128-cbc | 7.9MB/s |
| rijndael-cbc@lysator.liu.se | 12.2MB/s |
Analysis
The tests showed that throughput was increased dramaticly, and the limitation was no longer the TCP or SSH window size, but the ability of the host to encrypt at a rate fast enough to send out over the Gigabit Ethernet. This is clearly demonstrated by the vast performance difference between 3des-cbc, the slowest cipher, and arcfour, the fastest cipher.
Security implications
There are no implications that we know of with the following caveat: The use of the none cipher in the experimental hpn+none patch is experimental and you must use it at your own risk. Its use via the -z switch in scp will transfer your bulk data in the clear even though your authentication is encrypted. This should, natually, be seen as riskier than transfering data via an encryption cipher. Also, while we did our best to make sure that you can only use the none cipher to transfer bulk data via scp it may be possible to run an interactive session with the none cipher (see the note of 15 January 2005). We're investigating this but we think this situation to be unlikely. If you have issues with this use the approved non-experimental hpn patch.
Version History
- March 28, 2005: Fixed Typo in HPN Patch for 3.9
- I had a comma in the wrong place. Sorry if you had problems with the original version. I decided not to update the version number for this very minor fix.
- March 25, 2005: Updated HPN Patch for OpenSSH version 3.9p1
- This patch removes some unnecessary changes in the buffer.c code specifically the buffer->unlimited member of the buffer struct. It also makes some minor changes to the buffer sizes used on the scp <-> ssh pipe. The net result being a 20 to 60% decrease in the number of read/write syscalls in a typical transfer. This version of the 3.9p1 patch is also aware of OpenSSH version 4.0(p1).
- January 15, 2005: Security Fix for HPN/None Cipher Experimental Patch
- This is a security fix to the combined hpn/none cipher patch referenced in
the experimental patch release of November 12, 2004. This patch explicitly
checks to see if the tty_flag is set prior to switching to the none cipher.
It also *disables* the none cipher switch if the -T (no_tty_flag) switch is used.
Lastly, it also sends a warning to stderr when the switch to the none cipher
takes place.
If you are not using the combined hpn/none patch then you do not need to apply this patch. - November 12, 2004: Combined HPN and None Cipher Experimental Patch
- This is a combined patch for high performance networking through the
use of dynamically resized buffers and the none cipher.
This will only work with non-interactive sessions created by scp. With this
patch the authentication IS encrypted and only after the auhentication
key exchange is complete will the cipher switch to none for the bulk
data transfer.
USAGE: scp -z file user@destination:path - September 15, 2004: Patch for OpenSSH 3.9p1
- This patch will work cleanly with the OpenSSH 3.9p1 source code. Otherwise no changes to functionality were made.
- July 20, 2004: Compatability Testing fix
- This patch includes a test to check the version of the server. If the server does not incorporate the enhanced windowing routine it will be disabled. This will prevent the sshd buffer bug from being triggered in older versions of sshd.
- July 13, 2004: sshd input buffer fix
- Our code uncovered a bug in the manner in which the input buffer in sshd grew. It was possible to make the input buffer grow larger than a set maximum bound leading to a fatal exception for that sshd process. We’ve addressed this by explicitly checking the size of the buffer before allowing it to grow.
- July 12, 2004: Window Size fix
-
- Based on coversations on the openssh developers mailing list we’ve imposed a maximum size on the network buffer of 2^30-1 bytes. Because of how the ssh code uses buffers this is an effective limit of 2^29-1 bytes or 512 Megabytes. This should be sufficient for all but the longest and fattest network paths. It might be possible to increase this by 1 bit in the future but this should be sufficient for now.
- The CVS and Portable patches have been rolled into one patch. It might throw a warning against the Portable version but it shouldn’t cause any problems.
- July 7, 2004: Initial release
- First release of openssh-3.8.1p1-dynwindow patch v. 0.1
Patches
OpenSSH-3.9p1 HPN patch v. 0.1
OpenSSH-3.9p1 HPN with None Cipher patch v. 0.2
Experimental. Use at own risk!
See news of 15 January 2004 for more information.
OpenSSH-3.9p1 HPN with None Cipher patch v. 0.1
Experimental. Use at own risk!
See news of 12 November 2004 for more information.
Link removed as of 15 January 2005
OpenSSH 3.8.1
OpenSSH-3.8.1p1 DynamicWindow patch v. 0.6
OpenSSH-3.8.1p1 DynamicWindow patch v. 0.5
OpenSSH-3.8.1p1 DynamicWindow patch v. 0.4
OpenSSH-3.8.1p1 DynamicWindow patch v. 0.3
OpenSSH-3.8.1p1 DynamicWindow patch v. 0.2 (not available)
OpenSSH-3.8.1p1 DynamicWindow patch v. 0.1
News/Updates
Update 1.15.2005
We discovered a minor problem in that the experimental none cipher switching patch did not perform as expected. A failsafe which I thought was in place to prevent switching to the none cipher during interactive sessions wasn't performing as expected. The result being that it was possible to send data in the clear. However to do this the user would have to explicitly pass the -z switch to ssh. So if you didn't do that you did not send data in plaintext. However, I urge all users of the experimental patch to upgrade to the latest version as soon as possible. In the new patch I explicitly test for the tty_flag before the none cipher switch takes place. If the tty_flag is true then the cipher switch fails silently and the session continues with the original encryption cipher (this silent failure mode may change in future releases of this patch). Additionally, we now also check to make sure the -T (no_tty_flag) switch is not set before enabling the none cipher.Update 1.12.2005
We've started putting hpn-ssh on more production systems so we've be able to get a better idea of how it might perform in the wild. Initial results look good but it did illustrate the need to have your network buffers tuned properly. hpn-ssh will not make ssh run faster if your system is mistuned. Essentially it can only work with what it has. Also, never forget that disk I/O operations will be a likely bottleneck in some systems. An interesting test is to transfer a large file to disk and then transfer the same file to /dev/null. If there is a notable difference in throughput you are disk bound. Also, we're about to start up more active development again. We hope to tighten up some code and rethink some assumptions. This will likely just be incremental improvements. However, we also have some other ssh projects we'll be persuing. More on that later in a couple of months.Update 11.23.2004
Quick note here. The previously discovered problem with corrupted MACs on input actually stems from a hardware and/or driver bug with the intel e1000 cards. We think this fully justifies our continued use of HMAC even without data encryption - we'd not have found this bug without it. We have seen some problem with the syskonnect interface we replaced it with though. It turns out that under high loads the vm.min_free_kbytes is too low and crashes the interface. We set it to 12MB and it seems to work fine now.Update 11.19.2004
Last night we conducted an edurance test using the cipher switching version of hpn-ssh. Using pipes we pushed data from /dev/zero on a host at NCSA to /dev/null on a host at PSC. We were able to move 4.343 terabytes of data in 17 hours 38 minutes at an average rate of 71MBps. Instantaneous tranfer rates over 100MBps were seen with some regularity. Please note that this test did not make use of the disk subsystem and therefore the results should be viewed as having occured under laboratory conditions. In more practical situations users will see their throughput being limited by the speed of their disk IO.Update 11.12.2004
We're back from the Supercomputing Conference and I want to thank everyone who saw the poster and talked to me during the course of the past week. There has been a lot of interest and, hopefully, a lot of support will be forthcoming. In the meantime I've decided to release the combined hpn and none cipher patch. This is the first iteration of the patch and the none cipher support might be changing the near future. So long term compatablity is not assured. However, what this does is allow for midstream cipher switching. In particular the undocumented '-z' switch to scp will provide for encrypted authentication and unencrypted data transfer. We've seen speeds of over 692Mbps using this patch. This will *not* work with interactive (TTY) sessions - which you don't really want anyway. If the server doesn't support the none switch then the connection will fail silently. Regardless, use this patch at your own risk and view this as an experimental patch. Lastly, the use of the none cipher is *within* spec for the SSH v2.0 protocol. OpenSSH simply decided to not implement it for obvious reasons.Update 10.28.2004
We'll be presenting this work at the SuperComputing 2004 Conference in Pittsburgh, Pennsylvania November 9-12. The poster presentation is officialy from 5-7pm on the 9th so if you want to stop by and say hello we'll be there. Also, you can see some of the throughput rates we've been getting by going to the Internet2 Weekly Top Flow Reports. We start showing up during the week of September 20, 2004. Look in the Non-Measurement flow section for flows between PSC-NCNE and NCSA using port 52222. We've recently hit 50MB/s and we were strictly processor limited.Update 9.27.2004
We have uncovered some problems with using 9000 byte packets on the linux system. When we pull data from the linux system we consistanly get MAC errors (ssh checksum indicating data corruption). We were able to recreate with an unpatched SSH server so it doesn't seem to be a result of our patch. Also, the previously mentioned user report of asymmetrical transfer rates was determiend to be an issue with the system buffers and not the SSH code. If anyone else is seeing this behaviour please let us know.Update 9.24.2004
We resolved the problem on the linux 2.4 autotuning kernel by upgrading to 2.6. There also seems to be a minor problem with the way linux does memory accounting for the windows which might cause a problem with the rcv_ssthresh. However, users with non-autotuning kernels should not see a problem. As of this evening we were able to sustain 35MBps (280Mbps) in both directions. The OS X issue is still being explored.Update 09/24/2004
We have recently heard of and experienced some problems with asymmetrical performance. The first is on a linux 2.4 autotuning kernel. In this case it seems that a tcp bug *might* be preventing the window from updating properly. On Mac OS X (10.3.4 1.33Ghz CPU) we were able to get through asymmetry down to a 2:1 ratio (8.0 MBps sending v. 3.5 MBps receiving)but there is indication of some CPU bounding issues. We have another report of a user also seeing highly asymmetrical throughput but we've yet to be able to determine the cause. Reports from other users would be *very* useful at this stage.
The National Science Foundation and
Cisco Systems, Inc.