Background
Network testing on Probe was motivated by the observation that network transfers over the wide area were erratic in speed and considerably slower than expected. ORNL in particular was observing this behavior and so became the working partner for our investigations. The actual speeds observed were typically less than one megabyte per second (MB/s), although short bursts up to 9 MB/s were observed. It is believed that the data path from NERSC to ORNL is limited by a 100 megabit/sec (Mb/s) security monitoring link at the NERSC end, and by a 155 Mb/s link to ESnet at the ORNL end. This should allow performance up to 12.5 MB/s.
The test suite
To facilitate the testing, the "testrig" network test suite was identified and installed. See http://www.ncne.nlanr.net/TCP/testrig/ for details. Basically, the suite has code to analyze tcpdumps taken and to create graphic displays of the network activity. A graphics package provides the display, which allows zooming in to see fine detail. The package makes it easy to observe dropped packets, retransmissions, timeouts, buffer sizes and other details about the transfers. By using the cursor, a variety of quantitative measurements can be obtained.
Tests and results
A number of transfers were measured between local hosts and between NERSC and ORNL hosts. These used primarily ttcp, a tool for generating network test packets, with parameters to set network driver buffer sizes. A few tests were done with ftp and with hsi (hierarchical storage interface). All the effects observed with ftp and hsi were also observed with ttcp.
Tests between local hosts
All of these tests were done without using jumbo packets. The best performance was sending from swift-g0 to quail- g0.nersc.gov. A transfer using pftp got 23 MB/sec. A transfer using hsi (a new version under test) got 32 MB/sec. There were no timeouts and no retransmissions. It appears that the data rate was strictly host limited. Detailed examination shows a pattern of 32 KB writes (we noted that hpss_netopt.conf has a writesize of 32 KB)), with 0.4 ms pauses between the 32 KB groups. This was for both hsi and pftp. Within the 32 KB groups, the data rate was 55 MB/sec. We speculate that without jumbo packets, 55 MB/sec is probably about the best that these hosts can do. Adding in the pauses between 32 KB groups, the speed drops off to about 33 MB/sec. (32 KB per group at 55.5 MB/sec takes ~0.6 ms (32 K/55.5 M); adding the pause of 0.4 ms, we get 32 KB per 1.0 ms, or 32 MB/sec.) At a larger scale, we observe either 256 KB sequences (pftp), or 1 MB sequences (hsi). There is a 2.3 ms pause between these sequences, which lowers the data rates to those given for each type of client. We speculate, but have not confirmed, that the 256 KB and 1 MB effects are due to disk I/O buffer sizes.
Tests between wide area hosts
Tests between NERSC and ORNL show a consistent pattern of timeouts which trigger the slow start algorithm in the TCP stack and lead to reduced performance. At the beginning of the transfer, we see a rapid start up of the TCP transfer. This startup behavior attempts to rapidly reach the capacity of the network. It operates by sending two packets for every one acknowledgement received from the receiving host. This causes the number of packets to double every round trip time (RTT). (The RTT from NERSC to ORNL is about 60 ms.) As the data rate goes up, it reaches a point in a little over a second where a number of packets get dropped and the TCP stack times out and goes into slow start mode. This causes a delay of an additional 1 to 2 seconds, and even worse, the slow start algorithm takes up to 10 seconds to get the data rate ramped up to a reasonable value again. Eventually the data rate gets high enough that it times out again, and the cycle repeats. The data rate at which the time outs occur varies widely. There is some suggestion that the amount of external traffic fluctuates sufficiently to account for these variations. This external traffic competes with the test traffic on ESnet. However, we have also identified another potential problem related to buffer sizes in the switches and routers in the path. We believe that it is likely that switches and routers are suffering buffer overrun when the input data flows in over a Gigabit connection and flows out over a much smaller bandwidth. This would cause them to drop packets at just the time we see the initial TCP timeouts.
Discussion
As speeds over wide area networks have increased, it has become necessary to keep more and more data "in flight" to fill the network "pipe" between origin and destination. This added delay, in terms of data launched after a problem occurs, but before the sending host gets notified of the problem, has caused poor performance of the simpler flow control algorithms. There is considerable development in the network community to deal with this issue, but the bottom line is that it gets very difficult to get consistently high performance in the face of network errors. This is exacerbated by the situation that flow control information is derived from "lost" or erroneous packets, which may also be due to actual network errors.
Hosts have increased their buffers to the MB range, but switches and routers have not always done so. At steady states, these buffers are sufficient, but during "surges," they are not. We have investigated the use of "pause frames" between hosts and the first encountered switch or router. These provide flow control at the media level, but they are not part of the IP protocol, and so they do not pass through routers, and so they are not effective for flow restrictions which are not immediately adjacent to the sending host.
Recommendations
Given our observations of flow control behavior in our AIX TCP stacks, we recommend that the host buffers be set to 0.5 MB. This gives up some theoretical peak bandwidth over the network, but it tends to limit surges and reduce the number of timeouts which lead to "slow starts." The avoidance of "slow starts" seems to be more advantageous than the possible peak bandwidth in the actual behavior observed. Of course, this needs to be revisited whenever bandwidth restrictions are removed.
Future Work
We need to finish work on "pause frames" to see if we can avoid overflowing our switches and routers which are adjacent to hosts. We were able to get tcpdumps at the security monitoring port on the network which suggests that packets are being lost by the mechanism we describe. If this is confirmed, then "pause frames" should be able to resolve this problem. To date, we have not seen this effect when we tried to turn pause frames on.