*Patent pending

The Data Supercell

PSC's archival systemPSC's archival system

System Overview

To support the file repository needs of its users, PSC has deployed a file archival system named the Data Supercell*. The Data Supercell is a complex disk-based storage system with an initial capacity of 4 Petabytes, but its architecture will enable it to scale well beyond this initial deployment. This document explains how to store and access your data on the Data Supercell.

Because the Data Supercell is a completely disk-based system it can support data resource applications well beyond those of traditional tape-based data storage resources. Traditional archivers were designed to be used for data retrieval to support batch processing. The fact that the Data Supercell is disk-based means it offers very high transfer rates for your data. Because it is disk-based it is also a very flexible system. It can accommodate datasets of different sizes, including extremely large datasets. Its flexibility also means that interface and access methods for the Data Supercell can be customized to meet your needs. The Data Supercell, because it is disk-based, has been constructed to be highly reliable and highly secure.

The Data Supercell is well-suited for data storage needs well beyond those met by traditional archivers. For more information about the advantages of the Data Supercell over conventional archival systems, please see the Data Supercell page. If you want to discuss whether the Data Supercell could meet your data storage needs or to discuss what special arrangements we could make for your data storage application send email to remarks@psc.edu.

The Data Supercell is strictly a file storage system. It is not used for computing nor will you, in a running program, directly open files that reside on it. Moreover, you will not login to the Data Supercell to perform file transfers. You will access your archived files indirectly using file transfer software and services on other PSC systems.

Transferring Files

Your home directory on the Data Supercell is /arc/users/joeuser, where joeuser is your PSC userid. A variety of file transfer methods are supported to copy files to and from the Data Supercell.

Large Transfers Note: If you are going to store a file that is 2 Terabytes or larger, or if you intend to store more than 500 Gigabytes in a single day, please send email in advance to remarks@psc.edu so that special arrangements can be made to handle your large file transfers.

Far

To transfer files between the Data Supercell and a PSC production system (e.g., Blacklight), you can use PSC's File ARchiver command-line client program, far. far is available on all PSC production platforms. In addition to file transfers, the far program can also be used for file and directory management, such as getting a list of your files on the Data Supercell. See the far documentation for more details.

Note: We recommend that you execute far commands outside of your batch compute job scripts so that your jobs do not tie up compute processors and expend your computing allocation while your files are being transferred to and from the Data Supercell.

Sftp and scp

You can transfer files between your local systems and the Data Supercell using the SSH file transfer clients sftp and scp. When using sftp and scp to transfer files to and from the Data Supercell you do not connect directly to the Data Supercell. You transfer your files using a PSC high-speed data conduit named data.psc.xsede.org. You transfer files to and from the Data Supercell via data.psc.xsede.org. If you are not connecting to the data conduit from an XSEDE host you must use the name data.psc.edu for the data conduit. If you have a graphical sftp or scp client application on your local system, you can use it to connect and authenticate to data.psc.xsede.org and transfer files accordingly. Use your PSC userid and password for authentication.

If you need to (re)set your PSC password, you can do so via the kpasswd command on any PSC production system, or using the http://apr.psc.edu/ Web form.

You can use the command-line sftp client to transfer files to and from the Data Supercell interactively. When using sftp from the command line, you first connect and authenticate to data.psc.xsede.org, and then issue commands at the sftp> prompt to transfer and manage files:

$ sftp joeuser@data.psc.xsede.org

where joeuser is your PSC userid. The first time you connect to data.psc.xsede.org using sftp or scp, you may be prompted to accept the server's host key. Enter yes to accept the host key:

The authenticity of host 'data.psc.xsede.org (128.182.70.103)' can't be established.
RSA key fingerprint is d5:77:f2:d9:07:f6:32:b6:c3:eb:0d:d1:29:ed:9b:80.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'data.psc.xsede.org' (RSA) to the list of known hosts.

You will then be prompted to enter your PSC password:

joeuser@data.psc.xsede.org's password:
Connected to data.psc.xsede.org.
sftp>

At the sftp> prompt, you can then enter sftp commands (e.g., pwd, ls, put, get, etc.) to manage and transfer your files to/from the Data Supercell. Enter a question mark for a list of available sftp commands.

Examples (where joeuser is the user's PSC userid, and entered commands appear in bold):

  • Where am I?
    sftp> pwd
    Remote working directory: /arc/users/joeuser
  • Where am I on my local system?
    sftp> lpwd
    Local working directory: /Users/JoeUser/Documents
  • Change directories on my local system to /usr/local/projects:
    sftp> lcd /usr/local/projects
  • Make a new directory called "newdata" under my current directory on the Data Supercell :
    sftp> mkdir newdata
  • Copy a file (file1.dat) from my current local directory to my newdatasubdirectory on the Data Supercell:
    sftp> put file1.dat newdata/file1.dat
    Uploading file1.dat to /arc/users/joeuser/newdata/file1.dat
    file1.dat          100% 1016KB   1.0MB/s   00:00
  • Copy a file on the Data Supercell (/arc/users/joeuser/file1) to /usr/local/projects/newfile1on my local system :
    sftp> get /arc/users/joeuser/file1 /usr/local/projects/newfile1
    Fetching /arc/users/joeuser/file1 to /usr/local/projects/newfile1
    /arc/users/joeuser/file1          100%   31     0.0KB/s   00:00
  • Exit from this sftpsession :
    sftp> exit

    or

    sftp> bye

For scripted transfers, or transfers that you want to execute directly from your command-line shell, you can use the SSH scp client:

Examples (where joeuser is the user's PSC userid, and entered commands appear in bold, and the user enters their PSC password when prompted):

  • Copy my local file (/usr/local/projects/file1.dat) to my home directory on the Data Supercell:
    $ scp /usr/local/projects/file1.dat joeuser@data.psc.xsede.org:.
    joeuser@data.psc.xsede.org's password: 
    file1.dat          100% 1016KB   1.0MB/s   00:00  
    
  • Copy the contents of my newdata directory on the Data Supercell to /tmp on my local system (creating /tmp/newdata and copying all the files from newdata):
    $ scp -r joeuser@data.psc.xsede.org:newdata /tmp
    joeuser@data.psc.xsede.org's password: 
    file2.dat          100% 1016KB   1.0MB/s   00:00
    file3.dat          100% 1016KB   1.0MB/s   00:01
    file1.dat          100% 1016KB   1.0MB/s   00:00
    

Rsync

You can also use the rysnc command to keep the contents of a local directory synchronized with a directory on the Data Supercell. A sample rsync command is

rsync -rltpDvP -e 'ssh -l joeuser' source_directory data.psc.xsede.org:target_directory

where 'joeuser' is your PSC userid and 'source_directory' is the name of your local directory and 'target_directory' is the name of the directory on the Data Supercell. If you are not issuing the command on an XSEDE host you must use data.psc.edu as the address in your rsync command. We recommend the rsync command options -rltpDvP. See the rsync man page for information on these options and on other rsync options you might want to use.

We have several other recommendations for the use of rsync. First, we recommend that you install the HPN-SSH patches to improve the performance of rsync. These patches are available online.

If you install the HPN-SSH options you can use the ssh options

-oNoneSwitch=yes
-oNoneEnabled=yes

in the -e option of your rsync command for faster data transfers. With the use of this option your authentication is encrypted but your data transfer is not. If you want encrypted data transfers you should not use this option.

Finally, whether or not you install the HPN-SSH patches we recommend the option

-oMACS=umac-64@openssh.com

If you use this option your transfer will use a faster data validation algorithm.

A convenient way to use these options is to define a variable whose value is the options you want to use. An example command is to do this is

setenv SSH_OPTS '-oMACS=umac-64@openssh.com -oNoneSwitch=yes -oNoneEnabled=yes'

You can then issue your rsync command as

rsync -rltpDvP -e 'ssh -l joeuser $SSH_OPTS' source_directory data.psc.xsede.org:target_directory

The above recommendations, excluding the recommenation for the rsync options, are also appropriate for sftp and scp. You should apply the HPN-SSH patches and the use the above ssh options if suitable. If you define an SSH_OPTS variable which has the value of your ssh options you can issue your sftp command as

sftp $SSH_OPTS joeuser@data.psc.xsede.org

and your scp command as

scp $SSH_OPTS local_file joeuser@data.psc.xsede.org:remote_file

Globus-url-copy

XSEDE users may use GridFTP clients to transfer files to and from the Data Supercell.

To use the command-line client globus-url-copy on an XSEDE-system login host (e.g. Blacklight), first ensure that you have a current user proxy certificate for authentication with enough time on it to complete your transfer, e.g.:

joeuser@tg-login1:~> grid-proxy-info
subject  : /C=US/O=National Center for Supercomputing Applications/CN=Joe User
issuer   : /C=US/O=National Center for Supercomputing Applications/OU=Certificate Authorities/CN=MyProxy
identity : /C=US/O=National Center for Supercomputing Applications/CN=Joe User
type     : end entity credential
strength : 2048 bits
path     : /tmp/x509up_u99999
timeleft : 11:58:33

If the timeleft is not sufficient, or you get an "ERROR: Couldn't find a valid proxy" message, then use myproxy-logon (or if you have your own long term user certificate, grid-proxy-init) to obtain a new user proxy certificate, e.g.:

joeuser@tg-login1:~> myproxy-logon -l joexsedeuser -t 24
Enter MyProxy pass phrase:
A credential has been received for user joexsedeuser in /tmp/x509up_u99999.

where joexsedeuser is your XSEDE User Portal login name, -t 24 requests a 24-hour certificate, and the MyProxy pass phrase entered is your XSEDE User Portal password.

You can then use globus-url-copy to transfer files to/from the Data Supercell using the GridFTP server address gsiftp://gridftp.psc.xsede.org. This transfer will go through the PSC high-speed data conduit data.psc.xsede.org.

Note: The gsiftp:// URLs are absolute paths to files. This means that when referring to a file or directory in your Data Supercell home directory, you must either use gsiftp://gridftp.psc.xsede.org/arc/users/joeuser/ or gsiftp://gridftp.psc.xsede.org/~/ to refer to your home directory (where joeuser is your PSC userid).

Examples:

  • List the files in my home directory on the Data Supercell:
    joeuser@tg-login1:~> globus-url-copy -list gsiftp://gridftp.psc.xsede.org/~/
    gsiftp://gridftp.psc.xsede.org/~/
    file1.dat
    file2.dat
    file3.dat
    newdata/
    olddata/
  • Transfer a file (testfile) from my scratch space on TACC Lonestar to my newdatadirectory on the Data Supercell:
    joeuser@tg-login1:~> globus-url-copy -stripe -tcp-bs 32M \ 
    gsiftp://gridftp.lonestar.tacc.xsede.org/scratch/99999/tg987654/testfile gsiftp://gridftp.psc.xsede.org/~/newdata/

    where -stripe and -tcp-bs 32M are used to improve transfer performance, and /scratch/99999/tg987654 is the scratch directory on Lonestar at TACC.

Gsiftp and gsiscp

If you have a current user proxy certificate you can also use gsiftp or gsiscp to transfer files to and from the Data Supercell. The method of obtaining a valid user proxy certificate is described above in the discussion of the globus-url-copy command. The default directory for both gsiftp and gsiscp is your Data Supercell home directory.

A sample gsiftp transfer session is

 

gsiftp data.psc.xsede.org
sftp>pwd
Remote working directory: /arc/users/joeuser
sftp>get file1.dat localfile1.dat
Fetching /arc/users/joeuser/file1.dat to /usr/local/projects/localfile1.dat
/arc/users/joeuser/file1.dat   100%   31    0.0KB/s    00:00
sftp>bye

 

A sample gsiscp command is

 

gsiscp data.psc.xsede.org:file1.dat localfile1.dat
file1.dat    100%   1016KB    101.0 MB/s      00:00

 

Globus Online

Globus Online users can access the Data Supercell at endpoint xsede#pscdata. You authenticate to the xsede#pscdata endpoint using your XSEDE User Portal username and password. When connecting to the xsede#pscdata endpoint on Globus Online, you may be redirected to the XSEDE OAuth page to enter your XSEDE User Portal username and password for authentication. After authentication, you will automatically be returned to the Globus Online site to initiate your transfers.

If you do not enter a path for the xsede#pscdata endpoint your destination will be your Data Supercell home directory. You can enter a path if you want a different destination on the Data Supercell.

You can also use Globus Online to transfer your blacklight brashear files to and from the Data Supercell. You would pick xsede#pscdata as both endpoints. To make your second endpoint point to your brashear files you must enter the path to your brashear directory. Otherwise Globus Online will use your Data Supercell home directory as both directories in the transfer.

Improving Your File Transfer Performance

File transfer performance between your local systems and data.psc.edu can be significantly improved by ensuring that your local systems' networking parameters are optimized. Guidance is available at PSC's Enabling High Performance Data Transfers webpage.

For improved performance when using SSH (sftp or scp), we recommend using an SSH package that includes PSC's High Performance Networking (HPN) patches, e.g., GSI-OpenSSH. For instructions to build OpenSSH with PSC's HPN patches, consult the PSC High Performance SSH/SCP - HPN-SSH webpage.

Last Updated on Wednesday, 24 October 2012 15:22  

More User Information

Passwords

Connect to PSC systems:

PSC Policies

Technical questions:

Call the PSC hotline: 412-268-6350 / 800-221-1641 or mail to remarks@psc.edu.