In this paper, the terms allocate and reserve are used to describe the conceptual assigning of memory to connections by the fair share algorithms. In the actual TCP Autotuning implementation, the fair share algorithms set upper limits of memory usage only. Actual memory allocation is independent of the algorithm and is performed on an as-needed basis.
In the first implementation of TCP Autotuning [SMM98], each connection determined its desired socket buffer size by multiplying its congestion window size appropriately. A pool of memory was made available for use in socket buffers for connections. Connections that desired less buffer space than the fair share reserved their desired amount of space. All the remaining memory in the pool was divided equally among the connections that desired more than the fair share. This equal share was defined as the new fair share, for use in the next periodic allocation of buffer space.
Each periodic allocation round involved three steps. The first step was to count the number of small connections, the number of large connections, and the sum of the buffer space desired by the small connections. The second step was to calculate the new fair share.
total pool - sum of buffer space of small conns
new fair share = -----------------------------------------------------
max(number of large conns, 1)
The third step was to allocate the min(new fair share, desired amount) of buffer space to each of the connections. The first step was implemented as one loop through all of the connections, while the third step was implemented as another (non-overlapping) loop.
The fair share would converge as long as each connection used an insignificant fraction of the memory resources. However, an oscillation could result when there are only a few connections, as described in the following example.
The problem of buffer space thrashing back and forth between a connection's reserved buffer space and the shared pool was the result of an implementation detail.Let C be a particular small connection. As C's desires grow to exceed the fair share, it becomes a large connection, donating the buffer space it had reserved into the shared pool. Now the larger pool is divided up among the large connections (of which there is one more now), and the amount each large connection receives is the new fair share.
Because the pool is larger than it was before C donated its buffer space into the pool, the new fair share may be larger too. If the new fair share is larger than C's desires, then C drops back to being a small connection in the next periodic allocation round, taking its desired buffer space out of the shared pool and keeping it for its own use. The smaller pool results in a smaller fair share, and an oscillation may begin.
Every connection gets up to 1/Nth of the unallocated pool, taking no more than it needs. If a connection needs less than it's 1/Nth portion, the remainder of the portion is left in the pool. After giving each connection a "slice", the remaining unallocated pool is divided up among the unsatisfied connections, continuing in this manner until the either the pool is empty or all connections are satisfied. More precisely:
For efficiency of processing, the threshold in the final step can be set to an arbitrary value to save processing time by sacrificing memory efficiency. (For instance, don't bother trying to allocate "slices" smaller than 512 bytes, since the additional buffer space won't make much difference, but allocating it will cost more iterations.)
- Let P = the total size of the pool, allocated and unallocated parts.
Let D(i) = the amount of buffer space desired by connection i.
Let B(i) = the amount of buffer space allotted to connection i so far (zero initially).- Let U be the set of all connections, i, for which B(i) < D(i).
Let N = the number of connections in U.
Let the current share = (P - sum(B(i)) for all i) / N- For all i in U,
- if D(i) < B(i) + current share then let B(i) = D(i)
- otherwise, increase B(i) by the current share
- Repeat from step 2 until U is empty, or the current share falls to a certain threshold (which may be zero).
This corrected implementation matches the algorithm described by Ma, et al. [MSZ96] more closely than the problematic implementation.
The three-phase allocation process used by the problematic implementation can result in the values measured in the first phase (first loop) being invalidated when the third phase is reached. This implementation uses the outdated fair share in the first phase when it counts the number of small and large connections and sums the desired buffer space of the small connections. These values are used to calculate the new fair share in the second phase. When allocations are made in the third phase (second loop), the number of small connections measured in the first phase may be wrong if the fair share has changed. The fair share then oscillates between two values.
The corrected implementation allocates slices throughout the process, rather than having explicit phases. All connections are treated the same, without basing calculations on how many connections have been classified as "large" or "small". A single loop handles the entire allocation round.
Following publication of the paper, the kernel monitoring tool was enhanced [SEM00], enabling the discovery of the oscillation of the first implementation. The author believes at this time that the lower performance of the few "auto" connections is likely to result from the oscillating fair share of the first implementation. Further testing is required to validate this belief, and is expected to be conducted as a by-product of upcoming projects.
[MSZ96] Qingming Ma, Peter Steenkiste, and Hui Zhang. "Routing
high-bandwidth traffic in max-min fair share networks."
ACM Sigcomm '96/Computer Communications Review, Volume 26,
August 1996.
[PSC98] "Automatic TCP Buffer Tuning Research" web page, Pittsburgh
Supercomputing Center, 1998. Available at:
http://www.psc.edu/networking/auto.html
[SEM00] J. Semke. "PSC TCP Kernel Monitor," Technical Report #CMU-PSC-TR-2000-0001,
Pittsburgh Supercomputing Center, Carnegie Mellon University, May 2000.
Available at: http://www.psc.edu/publications/tech_reports/pscmonitor/pscmonitor.html
[SMM98] J. Semke, J. Mahdavi, and M. Mathis. "Automatic TCP Buffer Tuning,"
ACM Sigcomm '98/Computer Communication Review, Volume 28,
Number 4, October 1998.