PROFILE-SS User manual
Optimal Profile-Sequence Alignment Program
Authors: Alexander J. Ropelewski, Dr. Hugh B. Nicholas and Dr. Michael Gribskov.
Copyright (C) 1992-1998 Pittsburgh Supercomputing Center. Portions of this code are Copyright (C) 1987 by Dr Michael Gribskov (used with permission).
The PROFILE-SS program was developed under the National Institutes of Health NCRR grant 1 P41 RR06009 and enhanced under NIH NCRR grant 2 P41 RR06009
This document refers to PSC-PROFILE-SS V1.8.
- Introduction to PROFILE-SS
- Implementation on Architectures
- General program information
- Program commands
- References
- Sample program input
- Sample Cray NQS job
- Sample profile
- Nucleotides
- Amino acids
Introduction
PROFILE-SS is a program that combines the optimal local sequence alignment algorithm developed by Michael Watterman and Mark Eggert (For detailed information on the algorithm and its extensions refer to The Journal of Molecular Biology (1987) 197, PP 723-728) and the Profile analysis methods of Michael Gribskov (For more information, refer to Methods in Enzymology (1990) 183, PP 146-159 ) into one program. The resultant program will find the N-best alignments between a database sequence and a profile.What is Profile Analysis?
Profile analysis is the process of using information obtained from a family of related sequences to locate other related sequences that are members of this family. Using profiles allows the user to perform a more sensitive database search than would be performed with a single sequence vs the database.
The first step in the profile analysis process is to locate a number of related sequences. This can be done by performing database searching with the MAXSEGS, FASTA, BLAST, SP or other programs using a single query sequence.
Once a few related sequences are located, they are then used to construct a multiple sequence alignment. This muliple sequence alignment can be created by programs such as the MSA program, the GCG PILEUP program, The Clustal program or by virtually any other program or method that is capable of producing multiple sequence alignments. For classical profiles, this multiple sequence alignment should then be placed in the GCG MSF file format and then run through the PROFILEMAKE program. The PROFILEMAKE program will create a profile that can be used in conjunction with the PROFILE-SS program to probe the databases for all related sequences. Profile-ss can also utilize profiles created by other methods, but the profiles may have to be reformatted into the GCG profile format.
Implementation on Architectures
There are two different implementations of the code availiable to users. The vector implementation is designed and optimized to run on vector supercomputers. The standard implementation is the vector code with Cray specific extensions removed from the code. The standard implementation compiles and runs correctly on most ASCII machines with ANSI FORTRAN-90 compilers, (including machines running UNIX and VMS. (Originally the code written in FORTRAN-77, but now it makes uses of FORTRAN-90 extensions for dynamic arrays. Compiling under FORTRAN-77 is possible, but static array sizes must be used.) A C compiler is also required. Please note that the standard version is not optimized for any scalar machine. The current code can be obtained at no cost for academic use. For more information on aquiring source code for your home site, contact the Pittsburgh Supercomputing Center's National Resource for Biomedical Supercomputing at biomed@psc.edu or http://www.nrbsc.org.
General program information
The program is keyword driven rather than menu-driven. The main reason for this is that the program was designed to be used in batch modes on various computers. It is much easier to read and write an input file containing descriptive keywords than to make certain that the right input is on the correct line.
Program input consists of a keyword followed by zero or more keyword parameters. No individual line of input can exceed 132 characters in width. Input lines take the form:
<COMMAND><SEPARATOR><PARAM_1><SEPARATOR> ... <SEPARATOR><PARAM_N>
Where <COMMAND> is the descriptive keyword
<PARAM_N> is a command parameter or sub-parameter
<SEPARATOR> is either the <SPACE>, <EQUALS> or <COMMA>
characters (All separators are treated the same;
it does not matter which separator you use when
one is required.)
In this manual, the space separator is used between the command and its parameters, the equals separator is used to indicate that the next parameter is a subparameter of the previous parameter, and the comma separator is used between independant parameters.
The programs signal that they are waiting for an input line by displaying a the program name as the prompt.
Refer to the on-line manual pages or help file for the command needed to run any of the program installed at your site.
Program commands
The following are brief descriptions of the commands available in the programs:
Program Commands
The following are brief descriptions of the commands available in the program:
- Alphabet allows the
user to select one of several pre-defined alphabets or to define a
custom alphabet. An alphabet must be defined for the program to work
correctly.
- Echo toggles
command line echoing (useful in batch mode). Echoing is initially
turned off.
- Help displays helpful
information for the someone who entered the program accidentally. This
keyword does not access an internal help facility.
- Limit allows the user
to limit the amount of output produced. (It is used to define how
many alignments are to be presented for inspection.)
- Match executes the
alignment procedure once all required parameters are set.
- Quit "END-OF-INPUT"
keyword used to exit from the program, and to terminate the custom
alphabet defining mode.
- Score defines the
scoring method used to obtain the alignment(s).
- Sequence defines the
input files where sequences are to be read and aligned. Also used to
define an output file containing the alignments.
- Title defines a
descriptive label written on the program's output.
- Translate Allows a
user to specify DNA to protein translation.
- Width allows the user to
change the programs alignment width (useful if you are using a 132
column printer or terminal.)
- Zz No operation. Allows the user to comment the input file.
Generally, the order that the commands are entered into the program are unimportant. However there are a few exceptions. The "ALPHABET" command must preceed the "SCORE" command. Both "ALPHABET" and "SCORE" commands must preceed the "SHOW" and "MATCH" commands.
The HELP command
The HELP command is intended to provide someone who has entered the program accidentally with enough information to allow them to exit gracefully. The help command DOES NOT access a built-in help facility.
Usage: "HELP"
The SCORE command: How to select a GAP and NEWGAP penalty
The SCORE command is used to select THE GAP (gap extension) and NEWGAP (Gap open) coefficients. These values are maximum values. They are multiplied by the percentages columns found in the profile.
To select the GAP and NEWGAP coeficients two parameters are used:
- GAP=X
- assign this score (multiplied by the profiles gap extension percentage) when a gap is extended in length.
- NEWGAP=X
- assign this score (multiplied by the profiles gap open percentage) in addition to the gap score (multiplied by the profiles gap extension percentage) when a gap is extended in length.
Note: The values associated with these parameters should be entered as REAL (floating point) numbers.
Usage: "SCORE GAP=<REAL>,NEWGAP=<REAL>
The ECHO command
The ECHO command togels command line echoing. This command is most useful when the program is running in batch mode.
Usage: "ECHO"
The TRANSLATE Command: Perform DNA to Protein Translation
The translate command can be used to convert a DNA library into a protein library in any of the six possible reading frames. Reading frames are specified by a six digit integer number. The first three digits specify the forward frames, while the last three specify the reverse, complemented frames:
XXXXXX
||||||__Third reverse complemented frame
|||||
|||||___Second reverse complemented frame
||||
||||____First reverse complemented frame
|||
|||_____Third forward frame
||
||______Second forward frame
|
|_______First forward frame
The number 1 is used to indicate that the frame should be translated, while a 0 is used to indicate that a frame should not be translated. For example:
111000 - translate the forward frames one, two, and three.
100000 - translate only forward frame one.
000100 - translate only reverse, complemented frame one.
111111 - translate all frames.
Usage: "TRANSLATE 111000"
The TITLE command: Labeling your output
The TITLE command allows the user to write a title to the output stream. This command is used primarily for the users identification purposes. To label your output simply enter the TITLE command followed by the label. The label should be less than 125 characters. Spaces are kept, and the title does not have to be in quotes.
Usage: "TITLE <descriptive_title>"
The ALPHABET command: How to select the proper alphabet
The ALPHABET command is used to set the sequence alphabet to either one of several default alphabets or a user defined "set" alphabet. You must choose an alphabet before you issue the "MATCH" command. Default alphabets that are available are:
- PROTEIN
- Alphabet suitable for protein sequences (23 characters)
- NUCLEIC
- Alphabet suitable for nucleic acid sequences (5 characters)
- AMBIGUOUS
- Alphabet suitable for nucleic acid sequences (15 characters)
Default alphabets
The protein alphabet is defined as: (where the notation "{ }" is used as set notation)
{A,a}, {B,b}, {C,c}, {D,d}, {E,e}, {F,f}, {G,g},
{H,h}, {I,i}, {L,l}, {M,m}, {N,n}, {P,p}, {Q,q},
{R,r}, {S,s}, {T,t}, {V,v}, {W,w}, {X,x}, {Y,y},
{Z,z}
This notation means that every letter between the "{}" has the same meaning. For example, {A,a} means that the lowercase "a" and the uppercase "A" are treated as equivalent letters.)
The nucleic alphabet is defined as:
{A,a}, {C,c}, {G,g}, {N,X,n,x}, {T,U,t,u}
The ambiguous alphabet is defined as:
{A,a}, {B,b}, {C,c}, {D,d}, {G,g}, {H,h}, {K,k},
{M,m}, {N,X,n,x}, {R,r}, {S,s}, {T,U,t,u}, {V,v},
{W,w}, {Y,y}
User defined alphabets
In addition to being able to select among several default alphabets, the user can also define a custom alphabet. The user can define several letters which appear in the sequence data as equivalent and represented in the output as a single letter. If the user enters the ALPHABET command without a parameter, he or she must specify the alphabet followed by the keyword QUIT. For example, if one would want to define a purine/pyrimidine alphabet R,Y over the letters A,G,C,T,U one would enter the following:
PROFILE-SS> ALPHABET
Please enter the alphabet (form A=B,C,D).
Use only one "=" per line, enter QUIT as the only
word on the line to end the alphabet-entering mode.
R=A,G
Y=C,T,U
QUIT
PROFILE-SS>
The letters A, and G are now all equivalent to "R". T, U, and C are now equivalent to "Y".
The following is an example of how to define a neutral-polar (P), neutral-nonpolar (N), acidic (A), and basic (B) alphabet.
PROFILE-SS> ALPHABET
Please enter the alphabet (form A=B,C,D).
Use only one "=" per line, enter QUIT as the only
word on the line to end the alphabet-entering mode.
P=S,T,Y,W,N,Q,C
N=G,A,V,I,L,F,P,M
A=D,E
B=K,R,H
QUIT
PROFILE-SS>
USAGE: "ALPHABET <pre-defined-name>
"ALPHABET"
The WIDTH command
The WIDTH command is used to change the output width. By default, the sequence alignments output width is set to 80 characters. The WIDTH command can be used to set the output width between 40 and 132 characters. To change the output width, enter the WIDTH command followed by the output width.
USAGE:"WIDTH=<INTEGER>"
The LIMIT command: Limiting the amount of program output
The LIMIT command can be used to limit the number of sequence alignments displayed. It can also be used to set limits for normalization. The LIMIT command takes up to seven parameters:
- CUTOFF=X
- Causes the program to display only alignments that have a similarity score of "X" or greater (RESULT file).
- ZCUTOFF=X
- Causes the program to display only sequence descriptions and scores that have an adjusted Zscore of "X" or greater (LIST and SCORE file).
- NUMBER=X
- Causes the program not to display any more than "Y" best alignments for any pair of sequences, regardless of how high the additional best sub-alignments scored. This parameter does not limit the total number of alignments in the output to "Y" alignments.
- MINSEQ=X
- Causes the program to search through only library sequences that have a length of "X" or greater. (Default=50)
- MINNORM=X
- Causes the program to use only library sequences that have a length of "X" or greater when computing normalization. (Default=50)
- PTCUT=X
- Scores more than "X" standard deviations away from their length pool are rejected during the calculations of the averages. (Default 5.0)
- POOLCUT=X
- Pools that are more than "X" standard deviations away from the expected score are rejected from the normalizations (Default 5.0)
Note: If you use any limit parameter, alignments will be printed until any of the above limits is reached.
Usage: "LIMIT CUTOFF=<REAL>,NUMBER=<INTEGER>,ZCUTOFF=<REAL>,
MINSEQ=<INTEGER>,MINNORM=<INTEGER>,PTCUT=<REAL>,
POOLCUT=<REAL>"
The MATCH command: Finding the optimal alignments
The MATCH command is used only after an alphabet and a scoring routine have been defined. The MATCH command causes the program to compute the similarity scores for the given sequences and to display the alignments (up to the limits imposed by the LIMIT command) The MATCH command has three distinctive modes: One allowing the user to select individual library sequences to be compared with the profile sequence (PICK mode), one allowing all the sequences to be compared with the profile sequence, but not allowing their alignments to be produced (SEARCHonly mode), and one allowing all the sequences to be compared with the profile sequence, and permitting their alignments to be produced (ALL mode).
The following parameters control the selection of the match mode:
- ALL
- Parameter used if the user wishes to compare the query profile with all of the library sequences and produce their alignments. This is the default mode.
- SEARCH
- Parameter used to indicate that the user wants to compare the query profile with all the library sequences but the resulting alignments are not desired. (ie SEARCHONLY)
- LIBRARY=<INDEX>
- Select the sequence in the library file refered to by <INDEX>. The index is either the Locus name (for a GenBank sequence), Sequence identifier (for an EMBL or SWISS-PROT sequence) or the sequence name (for NBRF sequences). This parameter is required to use the pick mode.
The following parameters are meaningful for all modes:
- AVERAGE
- Indicates that scores should be adjusted for sequence composition (default)
- NOAVERAGE
- Indicates that scores should be NOT be adjusted for sequence composition
- LOCAL
- Align sequences with the profile using the local alignment algorithm of Waterman and Eggert1987, J. Mol. Biol. 197: 723-728.
- GLOBAL
- Align sequences with the profile using a quasi-global sequence alignment algorithm similar to Sellers, Proc. Natl. Acad. Sci. USA.
The following parameters are meaningful only for the SEARCH and ALL modes:
- NORMAL
- Indicates that scores should be normalized (default)
- NONORMAL
- Indicates that scores should NOT be normalized
The following parameters are useful only for the PICK mode:
- SCORE=Y
- The actual score produced from the comparison with the individual library sequence and the profile. Normally a user would NEVER use this parameter, because the score of the query profile and the individual library sequence is unknown. This parameter IS useful however when one has created a SCORES file and simply wants to use produce alignments that score above a certain threshold.
- ZSCORE=Y
- The actual Zscore produced from the comparison with the individual library sequence and the profile. Normally a user would NEVER use this parameter, because the score of the query profile and the individual library sequence is unknown. (Similar to the SCORE parameter.)
The SEQUENCE command: How to select sequences
The SEQUENCE command is used to indicate what files the query profile and the library sequences are to be taken from. It is also used to indicate what file name the results should be written to.The following parameters have been incorporated with the SEQUENCE command:
- LIBRARY
- keyword which indicates that the library sequence(s) are to be taken from the named file.
- QUERY
- keyword which indicates that the query profile is to be taken from the named file.
- RESULTS
- keyword which indicates that the program results (the alignment(s)) are to be written to this named file.
- LIST
- keyword which indicates that a list of scores and sequence descriptions are to be written to the desired file. This file is similar to the resultant file of the GCG PROFILESEARCH program.
- SCORES
- Keyword which indicates that a PROFILE-SS input file is to be created from this run. One can then adjust the CUTOFF and ZCUTOFF values in this file to have the program retreive the alignments for only the highest scoring alignments.
The program is designed to search sequence files in the NBRF-PIR, GenBank, EMBL, Swiss-Prot, FASTA(*), and GCG individual sequence formats(*).
(*) Support for the FASTA and GCG formats are new in this release. We do not normally keep the databases in FASTA format, so this routine has not been thoroughly tested. We would appreciate receiving reports and bug-fixes that you may have with these input routines. Please send the reports/fixes to biomed@psc.edu
Usage: "SEQUENCE LIBRARY=<file>, QUERY=<file>, RESULT=<file> LIST=<file>, SCORES=<file>"
The QUIT command
The QUIT command is used to terminate the programs and to return the user to the operating system level. The QUIT command is equivalent to the "END-OF-FILE" marker, and entering the "END-OF-FILE" character will terminate the program as well.
Usage: "QUIT"
Sample program input
The following is sample program input to the PROFILESS program. The first line places a title on the output. The second line tells the program that the alignments should not have more than 80 characters per line. The third line selects the protein alphabet. The fourth line selects the gap and newgap coefficients The Gap extension penality is -8.0 while the Gap open penality is 0.0. The fifth line contains information to limit the amount of output. The program will produce no more than the 2 best sub-alignments between any pair of sequences. All alignments will need to score 5.0 or higher to be produced. The sixth line tells the program that our library of sequences is in the file "LIBRARY" while our query profile is in the file "QUERY". We want the alignments written to the file "RESULT". The next line indicates that we want all of the sequences in the query file matched with all of the sequences in the library file. The last line ends the program.
TITLE Phospolipase profile vs swiss-prot
WIDTH = 80
ALPHABET PROTEIN
SCORE GAP= -8.0, NEWGAP= 0.0
LIMIT CUTOFF= 5.0 NUMBER= 2
SEQUENCE LIBRARY=LIBRARY QUERY=QUERY RESULT=RESULT LIST=LIST
MATCH NONORMAL NOAVERAGE
QUIT
Sample Cray NQS job
# These first few lines set qsub options. # #QSUB -lt 4420 -lT 4870 #QSUB -lm" 6 mw" -lM " 6 mw" #QSUB -r 1PP2R -o 1PP2R -eo -s /bin/sh # #The QSUB commands contains LIMITS for time # (-lt 4420 and -lT 4870) and the # memory (-lm " 6 mw" and -lM " 6 mw ") # (where the upper case limit is for the JOB (-lT & -lM) # and the lower case limit is for each process # (-lt and -lm); thus you want the -lT bigger than -lt # so that the output is returned!), the runtime name # of the job (-r 1PP2R ), the output file from the # shell (-o 1PP2R ), and the shell that the job # should be run in (i.e., for the bourne shell, # this is: '-s /bin/sh'). # # Echos all commands to the log file (the cpr file). set -x # # Start Job Accounting # ja cd $TMP # # Fetch '1pp2r.nrl' # from the front end and copy it to the file # QUERY on the cray. # fetch QUERY -f CB -t '1pp2r.nrl' # # # Fetch from FAR the file # /biomed/db/swiss/swiss # And assign it to LIBRARY on the cray. # far get /biomed/db/swiss/swiss LIBRARY # # # You have chosen a standard database, SWISSPROT # /biomed/db/swiss/swiss # The current database has 22654 sequences with a total # of 7500086 bases. The current version is: # SWISS-PROT Protein Sequence Database. 20.0 (11/91) # # # Run program profiless with th input coming # From the following *HERE* Document. # (NOTE: This is shell specific!) # profiless <<\EOF ECHO TITLE Phospolipase profile vs swiss-prot WIDTH = 80 ALPHABET PROTEIN SCORE GAP= -8.0, NEWGAP= 0.0 LIMIT CUTOFF= 5.0 NUMBER= 2 SEQUENCE LIBRARY=LIBRARY QUERY=QUERY RESULT=RESULT LIST=LIST MATCH NONORMAL NOAVERAGE QUIT EOF # # This is the end of the input file. # # Start the cleanup of the directory by removing the # following files: # QUERY LIBRARY # rm QUERY LIBRARY # # Send RESULT on the Cray back to the VAX as # 'maxsegs.OUT' # dispose RESULT -t 'maxsegs.OUT' # # # Write out the results of the job accounting. ja -cstlh
Sample profile
(Peptide) PROFILEMAKE v4.40 of: unique.msf;3{*} Length: 15
Sequences: 11 MaxScore: 10.68 May 5, 1992 17:38
profilemake/data=pam80.gcg;1/inf=unique.msf;3{*}
Gap: 1.00 Len: 1.00
GapRatio: 0.33 LenRatio: 0.10
unique.msf{A2ap$H} From: 1 To: 15 Weight: 1.00
unique.msf{Cae1$X} From: 1 To: 15 Weight: 1.00
unique.msf{Cckn$P} From: 1 To: 15 Weight: 1.00
unique.msf{Cckn_2p} From: 1 To: 15 Weight: 1.00
unique.msf{Cckn_3p} From: 1 To: 15 Weight: 1.00
unique.msf{Co4$Hu} From: 1 To: 15 Weight: 1.00
unique.msf{Co4$Mo} From: 1 To: 15 Weight: 1.00
unique.msf{Co4_2hu} From: 1 To: 15 Weight: 1.00
unique.msf{Co4_2mo} From: 1 To: 15 Weight: 1.00
unique.msf{Co4_3mo} From: 1 To: 15 Weight: 1.00
unique.msf{Coli$B} From: 1 To: 15 Weight: 1.00
Symbol comparison table: Pam80.Gcg FileCheck: 5110
This is the PAM 80 scoring matrix in gcg format
April 22, 1992 16:56
Relaxed treatment of non-observed characters
Exponential weighting of characters
Cons A B C D E F G H I K ... Y Z Gap Len ..
B -22 5 -75 -20 -13 -18 -64 -63 -51 -67 ... -36 -16 24 24
S -22 -13 -45 -41 -30 -33 -58 -63 -94 -62 ... -86 -24 24 24
E -33 50 -20 49 77 -92 -74 -64 -82 -65 ... 35 -18 24 24
E -12 34 -16 30 56 -68 -53 -84 -81 -89 ... 24 -18 24 24
E -35 54 -52 47 65 -74 -73 -56 -94 -91 ... 52 -23 100 100
E -41 64 -64 74 99 -09 -75 -64 -99 -74 ... 71 -21 100 100
D -34 46 -07 58 41 -56 -59 -50 -21 -88 ... 14 -8 100 100
Y -80 -80 -80 -80 -80 -80 -80 -80 -80 -80 ... 0 -80 100 100
B -35 51 -44 39 34 -82 -31 -97 -96 -88 ... 15 -28 100 100
E -55 45 -31 52 61 -71 -50 -52 -12 -87 ... 23 -14 100 100
! 11
V -64 -47 -79 -44 -25 -85 -77 -14 -48 -96 ... -30 -63 22 22
E -28 8 -84 15 42 -39 -56 -77 -66 -74 ... 6 -26 22 22
B -35 21 -94 -15 -21 -05 -69 -57 -65 -69 ... -31 -29 22 22
B -16 -4 -42 -40 -37 -05 -42 -75 -69 -67 ... -71 -25 22 22
B -36 11 -72 -10 -7 -25 -35 -57 -90 -59 ... -43 -15 22 22
* 26 0 0 54 69 13 16 4 9 6 0 37
Nucleotides
NAME CODE MEANING
Adenine A A
Cytosine C C
Guanine G G
Thymine/Uracil T/U T/U
M A or C
R A or G
W A or T/U
S C or G
Y C or T/U
K G or T/U
V A or C or G
H A or C or T/U
D A or G or T/U
B C or G or T/U
X/N A or C or G or T/U
Amino Acids
NAME 3 LETTER CODE
Alanine Ala A
Cysteine Cys C
Aspartic Acid Asp D
Glutamic Acid Glu E
Phenylalanine Phe F
Glycine Gly G
Histidine His H
Isoleucine Ile I
Lysine Lys K
Leucine Leu L
Methionine Met M
Asparagine Asn N
Proline Pro P
Glutamine Gln Q
Arginine Arg R
Serine Ser S
Threonine Thr T
Valine Val V
Tryptophan Trp W
Tyrosine Tyr Y
Aspartic/Asparagine Asp,Asn B
Glutamic/Glutamine Glu,Gln Z
Unknown Xxx X
Pittsburgh Supercomputing Center National Resource for Biomedical Supercomputing
An NIH Supported Resource Center
300 S. Craig St.
Pittsburgh, PA 15213.
Phone: 412-268-4960,
Email: biomed@psc.edu
Please send suggestions for improving this code and error reports to biomed@psc.edu.
biomed-www@psc.edu (last updated: 2/17/98) © 1998 Pittsburgh Supercomputing Center