Nathan T.B. Stone, Doug Balog, Bryon Gill, John Kochmar, Rob Light, Paul Nowoczynski, J. Ray Scott, Jason Sommerfield, Chad Vizino
{stone, balog, bgill, kochmar, light, pauln, scott, jasons, vizino}@psc.edu
Pittsburgh Supercomputing Center 4400 Fifth Avenue Pittsburgh, PA 15213
"DMOVER" is a combination of three highly portable scripts and a network optimization specific to the Pittsburgh Supercomputing Center (PSC). DMOVER provides users with a queue-oriented means for initiating large, parallel, inter-site data transfers that is expected to be familiar to all mainstream high-performance computing users. Its flexibility enables users to effectively manage a parallel, load-balanced transfer of hundreds of files between dozens of file servers. It provided these highly desirable features at a time when they were not present in other standardized grid services and which remain elusive on some platforms. Performance measurements consistently indicate that file transfers remain disk-rate-limited and that the DMOVER strategy of maintaining parallel streams is a highly efficient means for aggregating available resources.
The Extensible Teragrid Facility 1 (ETF) is a massive collection of high-performance computing, networking and storage resources woven together by a software infrastructure designed to extend its functionality and ease of use to the current generation of scientific researchers. This substantial task has turned out an interconnected web of authentication & authorization, data transfer, job management and accounting software, which can still confuse even informed users. Many developers within the ETF community have rededicated themselves to eliminating this confusion, redesigning some interfaces or providing additional tools to ease the learning curve for users in the hopes that this will deliver the promises of grid computing to mainstream users. "DMOVER"2 is one such effort at the Pittsburgh Supercomputing Center (PSC) directed at facilitating large, inter-site, parallel data transfers.
Data migration is a recurring challenge for most ETF users. Despite the presence of a high-performance network and file transfer middleware, moving thousands or even hundreds of files across the ETF network can still be a slow or tedious process for mainstream users — those with a lower tolerance for instability or infrastructural complexity. Such users are perceived to be dependent upon resources and utilities delivering exceptional performance (in this case, aggregate bandwidth) exactly as advertised every time. This is an obvious challenge to the resource administrators and middleware and system software developers.
Furthermore, a recent survey of Hierarchical Storage Management (HSM) utilization at PSC revealed that over 93% of data stored in our HSM are transferred in "sessions" involving 10 or more files. In such sessions the average number of files was 378, with an average file size of 93 MB. So if we are to adequately address the problem of transferring real datasets it must be in a context of facilitating transfers with large file count and moderately large files.
Figure 1: Representative PSC host & network layout.
DMOVER2 is fundamentally an integration effort unifying three readily available resources on LeMieux.psc.edu3, PSC's 3000+ processor ETF computing resource. First, DMOVER uses the batch scheduling system to allocate both dedicated file servers and Application GateWay (AGW) nodes (see Section 3.4). Second, it uses the standardized grid file transfer tools (e.g. globus-url-copy) to transfer files resident within LeMieux's global parallel scratch file systems. And third, it uses a transparent communication library called "Qsockets" to redirect TCP-based traffic from the default Ethernet network to the Quadrics "QSNet"-based computational interconnect and to the ETF network via the AGW nodes (see Figure 1, above). DMOVER executes these transfers in parallel, managing a user-defined number of transfer streams until all designated files are transferred.
Our intent was also to create a "drop-in" replacement for alternative transfer methods — one that could perform as desired but require no modifications at other sites. DMOVER achieves this by layering portable scripts over low-level GridFTP tools. Using the batch scheduling system to initiate transfers also provides a means for scheduling network utilization.
At PSC, we were motivated by the expressed concerns of a particular user who had a clearly specified need: to transfer terabytes of data consisting of tens of thousands of moderately-sized files from PSC to the San Diego Supercomputing Center in a timely manner. (Transfer rates at the time, he complained, were typically of order 1-3 MB/s — more characteristic of a slow scp client than of any high-performance tool.) If we failed to establish an effective channel then this user would simply not compute at our site. So we were motivated to write our own solution both to raise the performance and to "lower the bar," making it likely that this user would use our solution in this case and perhaps others.
We first surveyed the grid middleware tools at our disposal, as follows:
If users are forced to write their own "helper scripts" or other tools in order to use our middleware they will simply not adopt it.
We resolved to build a lightweight parallel aggregation layer over globus-url-copy, to utilize its authentication scheme and tunable options. Our aggregator, now known as DMOVER, incorporated the following components at the PSC site:
Our solution requires a PSC user to do three things:
While some users might object to the scheduled nature of transfer initiation, preferring instead an interactive model, batch scheduling has long been recognized as a necessity for limited, shared resources. In the past the absence of network saturation may leave researchers with the impression that there is always enough bandwidth to spare on the ETF network. But with the advent of global WAN file systems and increasing demands of more sophisticated users, the network itself is not exempt from the competing demands of hundreds or thousands of distributed users — each with equally valid claims on the available bandwidth. We can exploit the scheduled nature of DMOVER to arbitrate these requests. Network engineers at PSC have confirmed the ability to manage ETF network bandwidth at the switch level with input from the scheduling system or other external agent, which could be an area of future research.
For each invocation, the scheduling system allocates the requisite number of AGW nodes for parallel, multi-stream file transfers. The DMOVER suite expands the source specification into a list of target files, builds the command lines, and launches multiple transfer streams in parallel, keeping a fixed number of streams running through the AGWs at all times until all transfers are completed. Below are details describing each of the components.
Figure 2: DMOVER architectural design
DMOVER consists of a trio of scripts: a batch script, a process manager and a transfer agent (command line wrapper), as shown in Figure 2, above. This solution required no modifications or additions to any other existing elements at any other site.
The first script in the trio is the DMOVER batch script (e.g., dmover.pbs). This is executed by submission to the local site's scheduler (e.g. OpenPBS), at which point the scheduler initiates the transfer. The advantage of this approach is that batch scripts are highly familiar to all HPC users. This eliminates the usage barrier of interface style. It provides the added advantage of scheduling file transfers, and thereby scheduling network utilization — a long-time goal of some grid infrastructures. Figure 3 below shows a sample job script that a user could queue to initiate DMOVER transfers from Pittsburgh to San Diego via four parallel streams.
Note first that the user specifies the number of streams for the transfer by setting both the number of nodes to use from the dedicated pool of file servers (rmsnodes=4:4) and the number of AGW nodes (agw_nodes=4). These directives are for the scheduling system, thus are prefixed with the "#PBS -l" resource specification. Per-stream performance is optimal when using one process per node (e.g., "rmsnodes=4:4") as compared to sending multiple streams from the same file server (e.g., "rmsnodes=1:4", representing a request for 1 node on which 4 processes will run). Furthermore, performance is best when the number of AGW nodes reserved matches the number of transfer streams. (See Section 5 below for a more thorough discussion of number of streams and performance.)
#PBS -l rmsnodes=4:4 #PBS -l agw_nodes=4 # root of the file(s)/directory(s) to transfer export SrcDirRoot=$SCRATCH/mydata/ # path to the target sources, relative to SrcDirRoot export SrcRelPath="*.dat" # destination host name export DestHost=tg-c001.sdsc.teragrid.org, tg-c002.sdsc.teragrid.org,tg-c003.sdsc.teragrid.org, tg-c004.sdsc.teragrid.org # root of the file(s)/directory(s) at the other side export DestDirRoot=/gpfs/ux123456/mydata/ # run the process manager /scratcha1/dmover/dmover_process_manager.pl "$RMS_NODES"
Figure 3: Sample DMOVER job script (dmover.pbs).
The "SrcDirRoot" specifier identifies the absolute root path to the files to be transferred. This is a convenience that may or may not help the user in specifying "SrcRelPath". The "SrcRelPath" variable specifies the files and/or directories to be transferred relative to "SrcDirRoot" by name or by wildcard. In the case where there is a single directory tree to be transferred this is a straightforward process. But the value of SrcRelPath is expanded in the DMOVER process manager script by UNIX glob, which may in general match a large number of files and/or directory trees. The net result is that the process manager converts this information into a list of file targets to be transferred.
The destination is specified as a hostname string. In general that string can be a comma-separated list of valid hostnames. This was preferable over using some DNS-based host aliases because earlier versions of globus-url-copy would abort transfers in the case where the gethostbyname() invocation returned a server that was down as the first entry in the IP list. By allowing users to specify only servers that are up and available, we enabled them to avoid known potential problems if they so choose.
This understated feature is of extreme relevance to bridging the gap between "early-adopters" and "mainstream users". There is often a disconnect between the tools used by these respective groups — one arcane version for early adopters and one "simple" version for mainstream users. The latter is often "simpler" because it is completely different, and many of the control interfaces are removed or obscured.
In this case, we provide one single field for specifying a destination host. It can either be something "simple", like a single DNS-based host alias ( e.g., tg-gridftp.psc.teragrid.org), or it could be a list of selected server names ( e.g., tg-c001.sdsc.teragrid.org, tg-c002.sdsc.teragrid.org,...…). But in either case the vehicle is the same, and the user can specify as little or as much detail as desired. This strategy effectively bridges the complexity gap by allowing either case, growing with the users as they grow their experience.
The next script level, the DMOVER process manager, will use a round-robin scheme to utilize all of the available hosts in turn for each successive transfer, thus achieving a basic level of load-balanced parallelism.
As noted above, the process manager (dmover_process_manager.pl) parses and expands the values of the source and destination variables the user specifies in the batch script, building a list of specific transfers from these values. It then spawns an independent transfer for each case, launching a fixed number of transfers in parallel. As earlier transfers finish the process manager selects another transfer from the list of remaining targets and starts it. Although the process manager itself runs on one of the OpenPBS "mom" nodes (currently configured as the LeMieux login nodes) the transfers are initiated from the dedicated file servers allocated to the DMOVER queue.
This mode of launching many parallel transfers simultaneously is precisely what many users have been missing. This is contrasted with striped transfer, which is typically only advantageous for small numbers of very large files. If all other transfer characteristics are equal, parallel streams will generally out-perform striped streams for cases where the number of files is much larger than the number of servers. And, as noted in the introduction, the average number of files transferred in a typical "session" is an order of magnitude greater than the number of file servers available at any site.
Each of the parallel transfer streams is a single execution of the DMOVER transfer agent (dmover_transfer.sh). The transfer agent executes the grid transfer client of choice (currently globus-url-copy) on one of the LeMieux file servers. It uses the source and destination arguments supplied by the process manager and supplies these and other proper defaults to the file transfer client. But before it executes the Globus transfer it modifies the LD preload path (_RLD_LIST) to allow the otherwise "normal" TCP socket operations of the transfer client to be redirected over Qsockets, described in detail below. Using this technique we have benchmarked single-stream, memory-only transfers across the ETF using globus-url-copy at roughly 1 Gb/s, or the theoretical maximum per-stream bandwidth for the hosts available.
The encapsulation of the final execution command line in the transfer agent script also provides a place for advanced users to optimize their transfer. For example users could select an alternative grid file transfer client (e.g., gsincftp) or customize their command line parameters to suit their needs by either modifying the provided script or substituting their own in the process manager.
"LeMieux" is the largest compute resource on the ETF today. Its architecture leverages the high speed QSNet interconnect between nodes, but does not inherently provide for large bandwidths to remote hosts. Instead of connecting each compute node to the ETF network directly, our design team introduced the concept of Application GateWay (AGW) nodes.
AGW nodes have network interfaces both on the ETF network (2 x 1 GigE NICs per node) and on the PSC QSNet network (1 QSNet NIC per node). The two-to-one matching between GigE and QSNet NICs led to a natural division of each AGW node into two "virtual" nodes, one serving each GigE interface. Each virtual node is equipped with a Qsockets "Qserver" that routes traffic from the PSC QSNet network (e.g., from LeMieux) to the ETF in a seamless manner. This configuration is so successful that an AGW node has no difficulty filling each of its GigE pipes from the QSNet, as described below. On the LeMieux side, Qsockets provides a client library that intercepts TCP-related system functions. In the case of setup/tear-down or ioctl functions, Qsockets merely passes these to the AGWs as requests over a QSNet-based RPC protocol. In the case of send operations the data is transferred over the QSNet to the AGW and from there routed to the ETF (and the converse for receive operations). In this manner processes running on LeMieux compute nodes can behave programmatically just as though they were directly connected to the ETF. This is all achieved without modifying any source or compiled code in the target application.
The efficiency and effectiveness of this strategy were proven during a live demonstration at the SC'04 Conference14 in November of 2004. In preparation for SC'04, PSC established a team whose purpose was to design an end-to-end science demonstration suitable for the SC'04 "Bandwidth Challenge" event. The strategy of this team was to use a real scientific application that was instrumented to write its checkpoint data to a remote file server over a TCP connection. That much of the application had been created years prior as part of a simple checkpoint-recovery demonstration. For the Bandwidth Challenge, this application was redirected to transmit data not merely over Ethernet but over a combination of the QSNet and a custom 40 Gb/s link to the SC'04 show floor without changing any networking code in the application, but merely by setting the LD preload path as described above. Using this configuration that team sustained a write bandwidth of over 31.1 Gb/s end-to-end: from 32 LeMieux compute nodes, over 16 QSNet / dual GigE-connected AGW nodes, to 7 10GigE-connected file servers on the SC'04 show floor. This clearly demonstrates the high performance (high aggregate bandwidth) and low overhead (31.1 Gb/s over 32 GigE links on only 16 nodes) of the Qserver routing protocol on the AGW nodes.
The queue and process elements described above are completely portable. Every super-computing site has a scheduling system that can facilitate remote execution of multiple transmission streams on specialized nodes. Furthermore, the DMOVER scripts are written in Perl and Bourne shell. The only non-portable aspect of this system is the AGW/Qsockets layer, which is a custom interconnect feature. Other sites have addressed their networking needs differently, for example by connecting compute nodes directly to the WAN. So the AGW/Qsockets layer does not require portability. If the DMOVER scripts were run at other sites mainstream users could simply comment out the AGW/Qsockets-related lines and run like at PSC.
Whether users would want to import DMOVER to a non-PSC site remains a valid question. Without DMOVER, the only parallelism available to mainstream users is the number of GridFTP servers that have been configured by the site administrators, who certainly want to configure access to their file systems in an optimal manner. With DMOVER, users have only to queue a DMOVER transfer job to a valid execution queue, and they can utilize as many parallel streams as are available on that compute resource. This constitutes a perfectly reasonable utilization of HPC resources, and a mainstream user's compute allocation. >
Below (Figures 4a and 4b) are performance measurements recorded for DMOVER transfers involving 32 files of size 2 GB (blue diamonds), 100 files of size 200 MB (red squares) and 300 files of size 100 MB (green triangles) from Pittsburgh to San Diego, using their respective parallel file systems, /scratchb2 and /gpfs. These represent two reasonable test cases and a typical user storage scenario (as noted in Section 1.1), respectively. The transfers were performed varying the number of streams, and the bandwidths are reported as a function of the number of streams. Figure 4a is a plot of the bandwidth aggregated over all parallel streams and Figure 4b is a plot of the average per-stream bandwidth. At the time of testing the compute resource, LeMieux was fully loaded. Thus, the values represent a typical case for mainstream user activity. Indeed, these results are consistent with lower values observed in other more optimal circumstances.
First, we observe the trends in the data. In all cases, we observe that the per-stream bandwidth for larger files is greater than that for smaller files. Figure 4a shows that our 2GB test case achieves roughly double the bandwidth of the 100MB "average user" case. This exposes the extent to which the protocol overhead requires longer transfers to amortize the per-stream setup cost. We further observe in Figure 4b that the per-stream bandwidth decreases with increasing number of streams. Thus, although increasing the number of parallel streams does reduce aggregate transfer times, this is at odds with the downward pressure on the scalability of the parallel transfers.
The aggregate bandwidth measurements are encouraging. Even for the highest number of streams shown, we do not appear to have reached a plateau or a point of diminishing returns for increasing stream count, despite the loss in efficiency. So we clearly demonstrate that users can easily achieve several hundred MB/sec aggregate with DMOVER by selecting the appropriate number of streams for their file size.
We continue by analyzing the host and network configuration underlying these results. At the sending side, the number of hosts and associated GigE interfaces was matched to the number of streams, up to a maximum of 16. At the receiving side, the maximum was 12. Since the highest per-stream bandwidths observed up to 12 streams represent only 20-30% of the available network bandwidth to each host, the limitation is clearly not in the network. Furthermore, since (at least for cases with 12 or fewer streams) the transfer clients at each end are running on separate hosts, we know that the processes cannot be adversely impacting one another. And yet, the observed per-stream bandwidth clearly decreases with increasing stream count, even below 12. We conclude that the predominant bottleneck is in the parallel file systems at one or each end. To isolate this further (e.g., identify the limitations at each end) we would have to use the same transfer tool (globus-url-copy) to run device-only copies (e.g., from /dev/zero to /dev/null ). This feature is not yet supported by globus-url-copy for file:// type transfers. Device-only transfers between servers, however, have been benchmarked in excess of 110 MB/sec — wire speed for GigE-connected hosts.
Various groups have created ongoing ETF network and GridFTP diagnostics that continually monitor the performance of the network. One such effort is the PSC "SpeedPage"15. This tool has measured point-to-point single-stream bandwidths in excess of 90 MB/sec for some sites and file systems. This further supports the conclusion that investment in high-performance parallel file systems will make the greatest difference in inter-site GridFTP transfers.
While the SC'04 Bandwidth Challenge well established the efficiency and effectiveness of Qsockets for routing high throughput traffic from legacy applications, this is not sufficient to demonstrate its usefulness in the context of grid computing. The ultimate measure of the success of grid computing is the number of users and applications that use it to succeed in tasks that they could not have otherwise accomplished. The original incentive to create the DMOVER framework was the expressed need of an experienced user who wanted to migrate large amounts of data (of order Terabytes) in large numbers of files (of order of tens of thousands) from PSC to San Diego. Furthermore, he wanted to do this with regularity — not merely once or as a proof of concept. This motivated us to create an infrastructure to achieve substantial throughput from end-to-end between file servers at opposite ends of the country using Globus tools.
After using DMOVER, our target user observed that this tool got him "past the Globus roadblock". He transferred Terabytes of data from PSC to San Diego at roughly 200 MB/s aggregate. The parallelization model of DMOVER was precisely what enabled this user to achieve this level of aggregate performance. Since deployment our records indicate that over a half-dozen other users, excluding developers, have utilized this service.
DMOVER is a tool for initiating load-balanced parallel file transfers between sites on the ETF network. It achieves scalable aggregation of parallel streams thereby achieving a high effective bandwidth for large file-count transfers. Furthermore, this file-wise mode of parallelism is expected to be most applicable to typical user scenarios within the ETF community. DMOVER is an invaluable bridge between the raw grid services and the needs of mainstream users. Although the Globus ToolKit (GTK) incorporates similar functionality in RFT9, even that utility is limited by administrative configurations (having sufficient GridFTP servers running wherever your data is staged) and code portability issues (GTK is not available on all platforms). If there are few or no servers running in such a location that they can effectively serve the file system of interest, mainstream users would be forced to resort to some other file transport system like DMOVER. Fortunately, with the portability and simplicity of DMOVER, this should remain a viable option for mainstream users for the foreseeable future.
Exploration of the scheduled nature of file transfers, and thereby of the ETF network, is another area of potential work. Every expanding project and new idea in the minds of computational scientists brings us closer to the limits of our finite resources. Projects like DMOVER may also provide the means for scheduling data transport and network utilization for mainstream users.