The Data Supercell User Guide
- System Overview
- Applying For A Storage Allocation and Other Policies
- Transferring Files
- Improving Your File Transfer Performance
- Sharing Files
To support the file repository needs of its users, PSC has deployed a file archival system named the Data Supercell*. The Data Supercell is a complex disk-based storage system with an initial capacity of 4 Petabytes, but its architecture will enable it to scale well beyond this initial deployment. The Data Supercell storage system is managed by the SLASH2 file system software, developed at the PSC. This document explains how to store and access your data on the Data Supercell.
Because the Data Supercell is a completely disk-based system it can support data resource applications well beyond those of traditional tape-based data storage resources. Traditional archivers were designed to be used for data retrieval to support batch processing. The fact that the Data Supercell is disk-based means it offers very high transfer rates for your data. Because it is disk-based it is also a very flexible system. It can accommodate datasets of different sizes, including extremely large datasets. Its flexibility also means that interface and access methods for the Data Supercell can be customized to meet your needs. The Data Supercell, because it is disk-based, has been constructed to be highly reliable and highly secure.
The Data Supercell is well-suited for data storage needs well beyond those met by traditional archivers. For more information about the advantages of the Data Supercell over conventional archival systems, please see the Data Supercell features page. If you want to discuss whether the Data Supercell could meet your data storage needs or to discuss what special arrangements we could make for your data storage application send email to email@example.com.
The Data Supercell is strictly a file storage system. It is not used for computing nor will you, in a running program, directly open files that reside on it. Moreover, you will not login to the Data Supercell to perform file transfers. You will access your archived files indirectly using remote file transfer software and services on other systems.
When necessary, downtime for system maintenance will be taken Wednesdays between 8:30 am and 5:00 pm Eastern time.
Users will be notified the day before a scheduled outage. If you find that a scheduled downtime will be disruptive to your work, please contact firstname.lastname@example.org with as much lead time as possible and we will try to accommodate your needs.
Applying For A Storage Allocation and Other Policies
If you are an XSEDE user there are storage allocation policies that apply to you. These policies are described immediately below. If you are a non-XSEDE, non-commercial user, similar policies apply to you. For precise information on the policies that apply to you send email to email@example.com. If you are a commercial user send email to firstname.lastname@example.org for information on the storage allocation policies that apply to you.
If you are an XSEDE user, when you apply for an XSEDE computing allocation, you must also apply for a Data Supercell allocation. If you are an XSEDE user, you cannot get a Data Supercell allocation without also applying for a computing allocation on one of PSC's production resources. Your storage allocation is intimately tied to your associated grant. Each file you create while using your grant is assigned to that grant. Every one of your files must be assigned to one of your active grants. Three months after your grant expires all files on the Data Supercell that have been assigned to that grant will be deleted. This deletion process will occur even if you have other active grants. Thus, when you are creating files you must insure that they have been assigned to the proper grant to guarantee that files you want to retain are not deleted during this three-month purging process. To apply for an allocation of space on the Data Supercell you use will use the XSEDE User Portal, which is the same mechanism you use to apply for computing allocations.
When you apply for a storage allocation, as part of your XSEDE grant application process you request an amount of Data Supercell space. The maximum amount of space you can request is 40 Tbytes.If you need more space than 40 Tbytes, you must make special arrangements by contacting email@example.com. This amount is the quota for the files you can store under this grant on the Data Supercell. You cannot exceed this quota. However, there are no maximum file sizes or maximum number of files you can store.
A variety of transfer methods are available to copy files to and from the Data Supercell. Our recommended method of file transfer is Globus Online, which is discussed below. If you are unable to use Globus Online or any of the other Globus file transfer methods we recommend that you use sftp or scp.
PSC maintains a Web page that lists average data transfer rates between all XSEDE resources, including the Data Supercell. For example, you can find the average data transfer rate between blacklight and the Data Supercell in this table. If your transfer rates are significantly below these average rates or you believe that your file transfer performance is subpar, send email to firstname.lastname@example.org and we will examine methods of improving your file transfer performance.
Your home directory on the Data Supercell is /usr/users/joeuser, where joeuser is your PSC userid. You will usually need to know this path when transferring files to and from the Data Supercell.
If you are going to store a file that is 2 Tbytes or larger or you intend to store more than 500 Gbytes in a single day, send email to remarks so that special arrangements can be made to handle your large file transfers.
Globus Online is our recommended method of transferring data to and from the Data Supercell. To use Globus Online you must first create a Globus Online userid and password at the Globus Online Web site. Once you have logged in to Globus Online you can initiate your file transfer. For each transfer you must select two Globus Online endpoints, to which you must then authenticate. The endpoint to use for the Data Supercell is xsede#psc.data. If you are an XSEDE user you can use your XSEDE User Portal userid and password to authenticate to the Globus Online endpoint xsede#pscdata. When connecting to the xsede#pscdata endpoint on Globus Online you may be redirected to the XSEDE OAuth page to enter your XSEDE User Portal username and password for authentication. After authentication, you will automatically be returned to the Globus Online site to initiate your transfers.
If you are affiliated with an InCommon organization you can use your userid and password for that organization to authenticate to endpoint psc#dsc-cilogon if you have previously registered with PSC as an InCommon user. For information on how to register with PSC as an InCommon user or for further information about InCommon access send email to email@example.com.
If you are unable to use either the XSEDE or InCommon methods of authentication to Globus Online send email to firstname.lastname@example.org to see if you can use other methods of authentication to Globus Online.
If you do not enter a path for the endpoints xsede#pscdata or psc#dsc-cilogon your destination will be your Data Supercell home directory. You can enter a path if you want a different destination on the Data Supercell.
You can also use Globus Online to transfer files between your blacklight brashear space and the Data Supercell. You would use xsede#pscdata for both endpoints, since xsede#pscdata can also be used to point to blacklight. To make your second endpoint point to your brashear files you must enter as your Globus Online path for this endpoint the complete path to your brashear directory. Otherwise, Globus Online will use your Data Supercell home directory as your path in the transfer. For example, suppose you want to transfer brashear file largematrix.dat to your Data Supercell home directory. You would enter xsede#pscdata for both endpoints. For the path for the endpoint you are using to point to blacklight you must enter your path as /brashear/joeuser/largematrix.dat. For the other path, the path pointing to the Data Supercell, you can enter just largematrix.dat, since you are storing the file in your home directory on the Data Supercell. Then you can initiate the transfer by clicking the appropriate arrow button.
Sftp and scp
You can transfer files between your local systems and the Data Supercell using the SSH file transfer clients
scp. When using sftp and scp to transfer files to and from the Data Supercell you do not connect directly to the Data Supercell. You transfer your files using a PSC high-speed data conduit named data.psc.xsede.org. You transfer files to and from the Data Supercell via data.psc.xsede.org. If you are not connecting to the data conduit from an XSEDE host you must use the name data.psc.edu for the data conduit. If you have a graphical
scp client application on your local system, you can use it to connect and authenticate to
data.psc.xsede.org and transfer files accordingly. Use your PSC userid and password for authentication.
If you need to (re)set your PSC password, you can do so via the
kpasswd command on any PSC production system, or using the http://apr.psc.edu/ Web form.
You can use the command-line
sftp client to transfer files to and from the Data Supercell interactively. When using
sftp from the command line, you first connect and authenticate to data.psc.xsede.org, and then issue commands at the
sftp> prompt to transfer and manage files:
$ sftp email@example.com
joeuser is your PSC userid. The first time you connect to data.psc.xsede.org using
scp, you may be prompted to accept the server's host key. Enter
yes to accept the host key:
The authenticity of host 'data.psc.xsede.org (22.214.171.124)' can't be established. RSA key fingerprint is d5:77:f2:d9:07:f6:32:b6:c3:eb:0d:d1:29:ed:9b:80. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'data.psc.xsede.org' (RSA) to the list of known hosts.
You will then be prompted to enter your PSC password:
firstname.lastname@example.org's password: Connected to data.psc.xsede.org. sftp>
sftp> prompt, you can then enter
sftp commands (e.g.,
get, etc.) to manage and transfer your files to/from the Data Supercell. Enter a question mark for a list of available
joeuser is the user's PSC userid, and entered commands appear in
- Where am I?
sftp> pwd Remote working directory: /arc/users/joeuser
- Where am I on my local system?
sftp> lpwd Local working directory: /Users/JoeUser/Documents
- Change directories on my local system to
sftp> lcd /usr/local/projects
- Make a new directory called "
newdata" under my current directory on the Data Supercell :
sftp> mkdir newdata
- Copy a file (
file1.dat) from my current local directory to my
newdatasubdirectory on the Data Supercell:
sftp> put file1.dat newdata/file1.dat Uploading file1.dat to /arc/users/joeuser/newdata/file1.dat file1.dat 100% 1016KB 1.0MB/s 00:00
- Copy a file on the Data Supercell (
/usr/local/projects/newfile1on my local system :
sftp> get /arc/users/joeuser/file1 /usr/local/projects/newfile1 Fetching /arc/users/joeuser/file1 to /usr/local/projects/newfile1 /arc/users/joeuser/file1 100% 31 0.0KB/s 00:00
- Exit from this
For scripted transfers, or transfers that you want to execute directly from your command-line shell, you can use the SSH
joeuser is the user's PSC userid, and entered commands appear in
bold, and the user enters their PSC password when prompted):
- Copy my local file (
/usr/local/projects/file1.dat) to my home directory on the Data Supercell:
$ scp /usr/local/projects/file1.dat email@example.com:. firstname.lastname@example.org's password: file1.dat 100% 1016KB 1.0MB/s 00:00
- Copy the contents of my
newdatadirectory on the Data Supercell to
/tmpon my local system (creating
/tmp/newdataand copying all the files from
$ scp -r email@example.com:newdata /tmp firstname.lastname@example.org's password: file2.dat 100% 1016KB 1.0MB/s 00:00 file3.dat 100% 1016KB 1.0MB/s 00:01 file1.dat 100% 1016KB 1.0MB/s 00:00
You can also use the rysnc command to keep the contents of a local directory synchronized with a directory on the Data Supercell. A sample rsync command is
rsync -rltpDvP -e 'ssh -l joeuser' source_directory data.psc.xsede.org:target_directory
where 'joeuser' is your PSC userid and 'source_directory' is the name of your local directory and 'target_directory' is the name of the directory on the Data Supercell. If you are not issuing the command on an XSEDE host you must use data.psc.edu as the address in your rsync command. We recommend the rsync command options -rltpDvP. See the rsync man page for information on these options and on other rsync options you might want to use.
We have several other recommendations for the use of rsync. First, we recommend that you install the HPN-SSH patches to improve the performance of rsync. These patches are available online.
If you install the HPN-SSH options you can use the ssh options
in the -e option of your rsync command for faster data transfers. With the use of this option your authentication is encrypted but your data transfer is not. If you want encrypted data transfers you should not use this option.
Finally, whether or not you install the HPN-SSH patches we recommend the option
If you use this option your transfer will use a faster data validation algorithm.
A convenient way to use these options is to define a variable whose value is the options you want to use. An example command is to do this is
setenv SSH_OPTS '-oMACSemail@example.com -oNoneSwitch=yes -oNoneEnabled=yes'
You can then issue your rsync command as
rsync -rltpDvP -e 'ssh -l joeuser $SSH_OPTS' source_directory data.psc.xsede.org:target_directory
The above recommendations, excluding the recommenation for the rsync options, are also appropriate for sftp and scp. You should apply the HPN-SSH patches and the use the above ssh options if suitable. If you define an SSH_OPTS variable which has the value of your ssh options you can issue your sftp command as
sftp $SSH_OPTS firstname.lastname@example.org
and your scp command as
scp $SSH_OPTS local_file email@example.com:remote_file
XSEDE users may use GridFTP clients to transfer files to and from the Data Supercell.
To use the command-line client globus-url-copy on an XSEDE-system login host (e.g. blacklight), first ensure that you have a current user proxy certificate for authentication with enough time on it to complete your transfer, e.g.:
joeuser@tg-login1:~> grid-proxy-info subject : /C=US/O=National Center for Supercomputing Applications/CN=Joe User issuer : /C=US/O=National Center for Supercomputing Applications/OU=Certificate Authorities/CN=MyProxy identity : /C=US/O=National Center for Supercomputing Applications/CN=Joe User type : end entity credential strength : 2048 bits path : /tmp/x509up_u99999 timeleft : 11:58:33
timeleft is not sufficient, or you get an "
ERROR: Couldn't find a valid proxy" message, then use
myproxy-logon (or if you have your own long term user certificate,
grid-proxy-init) to obtain a new user proxy certificate, e.g.:
joeuser@tg-login1:~> myproxy-logon -l joexsedeuser -t 24 Enter MyProxy pass phrase: A credential has been received for user joexsedeuser in /tmp/x509up_u99999.
joexsedeuser is your XSEDE User Portal login name,
-t 24 requests a 24-hour certificate, and the
MyProxy pass phrase entered is your XSEDE User Portal password.
You can then use
globus-url-copy to transfer files to/from the Data Supercell using the GridFTP server address gsiftp://gridftp.psc.xsede.org. This transfer will go through the PSC high-speed data conduit data.psc.xsede.org.
gsiftp:// URLs are absolute paths to files. This means that when referring to a file or directory in your Data Supercell home directory, you must either use
gsiftp://gridftp.psc.xsede.org/~/ to refer to your home directory (where
joeuser is your PSC userid).
- List the files in my home directory on the Data Supercell:
joeuser@tg-login1:~> globus-url-copy -list gsiftp://gridftp.psc.xsede.org/~/ gsiftp://gridftp.psc.xsede.org/~/ file1.dat file2.dat file3.dat newdata/ olddata/
- Transfer a file (
testfile) from my scratch space on TACC Lonestar to my
newdatadirectory on the Data Supercell:
joeuser@tg-login1:~> globus-url-copy -stripe -tcp-bs 32M \
-tcp-bs 32Mare used to improve transfer performance, and
/scratch/99999/tg987654is the scratch directory on Lonestar at TACC.
Gsisftp and gsiscp
If you have a current user proxy certificate you can also use gsisftp or gsiscp to transfer files to and from the Data Supercell. The method of obtaining a valid user proxy certificate is described above in the discussion of the globus-url-copy command. The default directory for both gsisftp and gsiscp is your Data Supercell home directory.
A sample gsisftp transfer session is
gsisftp data.psc.xsede.org sftp>pwd Remote working directory: /arc/users/joeuser sftp>get file1.dat localfile1.dat Fetching /arc/users/joeuser/file1.dat to /usr/local/projects/localfile1.dat /arc/users/joeuser/file1.dat 100% 31 0.0KB/s 00:00 sftp>bye
A sample gsiscp command is
gsiscp data.psc.xsede.org:file1.dat localfile1.dat file1.dat 100% 1016KB 101.0 MB/s 00:00
To transfer files between the Data Supercell and a PSC production system, such a blacklight, you can also use PSC's far program. Far is available on all PSC production platforms. In addition to file transfers, the far program can be used for file and directory management, such as getting a list of your files on the Data Supercell. See the far documentation for more information.
We recommend that you execute far commands outside of your batch compute job scripts so that your jobs do not tie-up compute processors and you do not expend your computing allocation transferring files.
Improving Your File Transfer Performance
File transfer performance between your local systems and data.psc.edu can be significantly improved by ensuring that your local systems' networking parameters are optimized. Guidance is available at PSC's Enabling High Performance Data Transfers webpage.
For improved performance when using SSH (
scp), we recommend using an SSH package that includes PSC's High Performance Networking (HPN) patches, e.g., GSI-OpenSSH. For instructions to build OpenSSH with PSC's HPN patches, consult the PSC High Performance SSH/SCP - HPN-SSH webpage.
The Data Supercell is a Unix file system. Thus, to share files on the Data Supercell you must use Unix file protections. If you are are in the same Unix group on the Data Supercell, then you can use Unix group protections to share files. Otherwise, you must use Unix world protections. This would give access to your files to any user that can get access to the Data Supercell. To set your Data Supercell file protections you can use the far program or sftp. Some terminal emulators, such as WinSCP, will also allow you to set file protections on remote systems.
PSC requests that a copy of any publication (preprint or reprint) resulting from research done on blacklight be sent to the PSC Allocations Coordinator. In addition, if your research was funded by the NSF you should log your publications at the XSEDE Portal. We also request that you include an acknowledgement of PSC in your publication.
To get assistance on using the Data Supercell or to report a problem using the Data Supercell you have three options.
- You can send email to firstname.lastname@example.org, mentioning PSC in the subject line.
- You can send email to email@example.com.
- You can contact the PSC User Services Hotline at 412-268-6350 from 9:00 a.m. until 5:00 p.m., Eastern time, Monday through Friday.