Last updated: June 2005
13-jun-2005 Eagle/UK14, scratcha1, scratcha2, 16 servers, 2-way stripe
08-jun-2005 Eagle/UK1, scratcha1, scratcha2, 16 servers, 2-way stripe
23-oct-2004 Eagle, local, single node, 3-way stripe
11-aug-2004 Eagle, local, single process, 3-way stripe
24-apr-2004 Eagle, scratcha2, 16 servers, 1-way stripe
06-apr-2004 Eagle, scratch1, scratchb1, scratcha2 (2-way stripe), and local
19-mar-2004 Eagle, scratcha2, 16 servers, 16-way stripe
16-mar-2004 Eagle, scratchb1 and scratchb2, kernel patches, new cables, reboots
03-mar-2004 Eagle, new $SCRATCH configurations, 2-way stripe all scratch fs
19-jan-2004 Kite/UK1, all scratch fs including final scratchB1 and scratchB2 configurations
14-jan-2004 Kite/UK1, all scratch fs including original scratchB configurations
Introduction
IOTest is a flexible I/O benchmarking tool used to test the performance of parallel filesystems.
It can be configured to write, read, or write and read an arbitrary number of 64-bit values with single or multiple write/read
operations.
Open, write/read, and close operations are timed. Minimums and maximums are reported. All times are in units of seconds.
Bandwidth is estimated by dividing the total bytes transferred by total time. Minimum and maximum rates are reported. All
rates are in million of bytes per second. Note that the total times used to compute bandwidth are NOT the sum of minimum or
maximum open + write/read + close operations, but rather the minimum or maximum total times (i.e. a separate timer is used.)
An iteration count is supported. It allows the same test to be performed multiple times. Iterations are affected internally.
A single set of timer outputs for the accumulated time and data size is generated.
An option to run sweeps is also provided. Testing always begins with the node and process counts specified for the job. If the
sweep option is enabled, then after saving the original process per node (ppn) count, the current node count is divided by 2,
the new process count is set to the product of the new node count and ppn, and the next test is initiated. This pattern is
repeated until the node count reaches 0.
Sweeps are affected externally by the job script. Therefore, each sweep generates a set of timer outputs.
Arguments to the IOTest program are as follows:
NR number of 64-bit values per write
NC number of writes
NITER number of iterations
IREAD 0 for write, 1 for read, 2 for write then read
A job submission script is also provided. It supports the sweeps feature. Job configurations are noted for each test, e.g.
queue, filesystem, walltime, sweeps, node count, and process count.
scratch Details
The initial configuration of the scratchB filesystems was tested between 12-jan-2004 and 14-jan-2004. The scratchB1 and
scratchB3 filesystems were each using 8 servers of cluster B; scratchB2 was based on 16 servers.The servers underlying
these scratch spaces were overlapped, i.e. the eight servers associated with scratchB1 and the eight servers of scratchB3
comprised the 16 servers of scratchB2. The block size for scratchB1 was 1 MB, while scratchB2 and scratchB3 were
configured with a block size of 2 MB.
A second configuration was tested 19-jan-2004. The scratchB1 and scratchB3 filesystems had been merged. The new scratchB1
and scratchB2 filesystems were now both based on 16 servers (all of B cluster) with no overlap. The block size of both
systems was reduced to 256 KB (previously 1 MB and 2 MB respectively.)
The effect of the change in B2 block size yielded marginal improvements in every category, leading to the conclusion that
the change to a smaller block size was beneficial for the test case. More noticeable improvements were evident for B1, but
as the server count had doubled, this was expected. Write performance increased by about 50% and read performance
approximately doubled.
With the Eagle upgrade, the server count for both $SCRATCH file systems (scratch1 and scratch2) was increased from 4 to 8.
As a result, performance improved across the board by something like a factor of two. All of the scratch file systems
(scratch1, scratch2, scratchb1, and scratchb2) were configured to use 2-way striping (stripe dimensions had been equal to
the server count.)
All of the scratch file systems were recabled between March 10 and 17. A couple of kernel patches were also applied. No
other configuration changes were made. Therefore, the performance problems observed for scratchb1 and scratchb2 in early
March were solved by some combination of the patches, reboots, and recabling work.
Comments
Since many programs wait until all processes have completed an I/O operation, data corresponding to the maximum times
(minimum rates) might be regarded as more significant.
Buffering effects are apparent in many cases. Variation between minimum and maximum times can be quite large.
Time required to open and close files at scale is surprising. For the scratchB filesystems, worst case open+close times for
2048 files approached 25% of the total test time in one case.
Cleanups of scratch space are also expensive at scale. Time required to remove 2048 files from the scratchB2 and scratchB3
filesystems exceeded the time remaining after the tests had completed (about 13 minutes.)
Informal observations by Systems Support staff for a case involving millions of files and a set of 4 "rm" processes operating
in parallel indicated a sustainable rate of about 3 deletions per second.
Files remaining upon expiration of the scratchB jobs suggest that around 1750 files were removed in about 13 minutes, which
translates to about 2.25 deletions per second. At this rate, the aforementioned cleanups would require about 15 minutes to
completely remove all 2048 test files. Needless to say, the cleanup step will be eliminated from future tests at scale.
For the noted file size (200 MB per process), the best performance achieved by a scratch filesystem was about 1.5 GB/s for
reads (scratch7) and writes (scratchB2.) The decline in recent scratch7 measurements was likely caused by the fact that
scratch7 was operating at near capacity at the time these tests were performed.