NWGAP Users Manual

Globally Optimal Sequence Alignment Program

Authors: Alexander J. Ropelewski and Dr. Hugh B. Nicholas Copyright (C) 1991, 1992, 1998 Pittsburgh Supercomputing Center

The NWGAP program was developed and enhanced under the National Science Foundation cooperative agreement ASC-8500650 and the National Institute of Health National Center for Research Resources grant 1 P41 RR06009

This document refers to PSC NWGAP 2.4

  1. Introduction
  2. General Program Information
  3. Program Commands
  4. Dynamic Programming Alogorithm
  5. Implementation on Architectures
  6. Sample Program Input
  7. Appendices

Introduction

NWGAP is an optimal global sequence alignment program that makes use of the dynamic programming algorithm. The dynamic programming algorithm is a mathematical technique used to find the minimum or maximum of a discrete function. It has found use not only in biological sequence comparisons, but also in analyzing the relatedness of bird songs, in matching geological features distorted across faults, gas chromatography, speech recognition, and in general text analysis. In general text analysis, the dynamic programming method can be used to compute the minimum number of characters that need to be changed in order to convert one word to another (or one sentence to another). The implementations of the dynamic programming algorithm which are available in the NWGAP program are flexible enough to be easily adapted to these and similar uses, as NWGAP uses a modified version of the Needleman-Wunsch algorithm.

NWGAP can be used with all types of sequence data, but is particularly useful in the analysis of biological sequences such as DNA, RNA, and protein sequences. In addition, the program allows the user to select many different analysis options such as defining a unique alphabet and using a different scoring matrix. The user can also limit the amount of output received from the analysis as well.

About the algorithm as related to biological sequence analysis:

A genetic sequence is a representation of the genetic information in DNA by a character string, an unbroken series of letters. A sequence has a finite length and contains only those characters that belong to a specified alphabet. Biological sequences can usually be categorized into one of three classes, DNA, RNA, or protein. In the case of DNA sequences, the set alphabet used is a four letter alphabet that consists of {A, C, G, T}. Likewise, RNA sequences also use a 4 character alphabet {A, C, G, U}. Proteins, on the other hand, use a 20 character alphabet.

One of the most useful analysis that can be done on sequences is to compare a few newly discovered sequences (called query sequences) with a large library of previously known and well characterized sequences. The results of these comparisons are called alignments. An alignment is a specific juxtaposition of the characters in the two sequences being compared. These alignments are useful in that alignments produced by the dynamic programming algorithm show the maximum amount of similarity in the juxtaposed characters of the two sequences being compared.

Two distinctly different kinds of alignments can be generated using the dynamic programming algorithm. The researcher must determine which of the two alignments will be more appropriate. A GLOBAL alignment (which is done in the NWGAP program) will force all of the characters of both strings into juxtaposition with either a character from the other string or with an inserted blank. (Blanks are inserted because a global alignment requires the aligned strings to be the same length, the dynamic programming algorithm inserts the required blanks in the most favorable positions, which are not necessarily at the ends of a sequence.) Blanks inserted into a biological sequences are interpreted as the result of a physical process that caused part of the gene to be lost.) A local alignment (which is done in the MAXSEGS program) will find the subsequences of the two character strings that are the most similar to each other. This includes all of both character strings only rarely.

Dynamic programming alignments often give the researcher valuable information, determining whether or not the query sequence is a fragment of a larger sequence, or by showing what known sequences have either similar functions or similar genetic ancestors to the query sequence. Notable successes include the identification of several oncogenes (genes that are involved in the development of cancers) as genes closely related to naturally occurring growth factor genes. These identifications save many months of expensive trial-and-error experimentation.

The Dynamic Programming Algorithm

The dynamic programming algorithm computes a solution to a problem by examining all possible solutions to that particular problem. For aligning character strings, this means that all possible juxtapositions of characters are examined. In the case of local sequence alignments, the computation can be visualized by writing one sequence along the top of a two dimensional table and writing the other sequence down the side of the same two dimensional table. Each cell in the table corresponds to a specific juxtaposition of a character from each sequence. We then fill the table by computing all possible alignments between the two sequences. The best solution is found at the location of the maximum score in the table. Given this table of all possible alignments, S is computed by the following recursive formula:

REG(i)   MAX( REG(i)+C, S(i-1,j)+N+C, 0)
LEG(j) = MAX( LEG(j)+C, S(i,j-1)+N+C, 0) 
S(i,j) = MAX( REG(i), LEG(j), S(i-1,j-1)+VALUE(A(i),B(j)), 0 ) 

Where: S(i,0) and S(0,j) are defined as 0, when i and j >= 0. C = the gap constant. N = the newgap (open gap) constant. A and B = the two sequences being compared.

REG and LEG are used to determine if it is more beneficial to keep a old gap open, or to reintroduce a new gap.

VALUE( A(i), B(j) ) is a matrix of scores that measure the degree of similarity between pairs of characters (letters) in the alphabet from which the sequences are constructed.

The relative magnitudes of C, N, and the elements in the VALUE matrix determine how often and where blanks will be inserted into the original sequences in constructing the alignment.

Implementation on Architectures

There are two different implementations of the code availiable to users. The vector implementation is designed and optimized to run on vector supercomputers. The standard implementation is the vector code with Cray specific extensions removed from the code. The standard implementation compiles and runs correctly on most ASCII machines with ANSI FORTRAN-90 compilers, (including machines running UNIX and VMS. (Originally the code written in FORTRAN-77, but now it makes uses of FORTRAN-90 extensions for dynamic arrays. Compiling under FORTRAN-77 is possible, but static array sizes must be used.) A C compiler is also required. Please note that the standard version is not optimized for any scalar machine. The current code can be obtained at no cost for academic use. For more information on aquiring source code for your home site, contact the Pittsburgh Supercomputing Center's National Resource for Biomedical Supercomputing at biomed@psc.edu or http://www.nrbsc.org/.

General Program Information

The programs are keyword driven rather than menu-driven. The main reason for this is that the programs were designed to be used in batch modes on the various architectures. It is much easier to read and write an input file containing descriptive keywords than to make certain that the right input is on the correct line.

Program input consists of a keyword followed by zero or more keyword parameters. No individual line of input can exceed 132 characters in width. Input lines take the form:

[COMMAND][SEPARATOR][PARAM_1][SEPARATOR]...[SEPARATOR][PARAM_N]

Where [COMMAND]     is the descriptive keyword 
      [PARAM_N]     is a command parameter or sub-parameter 
      [SEPARATOR]   is either the <SPACE>, <EQUALS> or <COMMA>  
                    character. (All separators are treated the same; 
                    it does not matter which separator you use when 
                    one is required.)

In this manual, the space separator is used between the command and its parameters, the equals separator is used to indicate that the next parameter is a subparameter of the previous parameter, and the comma separator is used between independent parameters.

The programs signal that they are waiting for an input line by displaying the program name as the prompt.

Refer to the on-line manual pages or help file for the command needed to run any of the programs installed at your site.

Program Commands

The "ALPHABET" and "SCORE" commands must preceed the "MATCH" command. The following are brief descriptions of the commands available in the programs:

  • Alphabet allows the user to select one of several pre-defined alphabets or to define a custom alphabet. An alphabet must be defined for any of the programs to work correctly.
  • Echo toggles command line echoing (useful in batch mode). Echoing is initially turned off.
  • Help displays helpful information for ayone who enters the program accidentally. This keyword does not access an internal help facility.
  • Limit allows the user to limit the amount of output produced. (It is used to define how many alignments are to be presented for inspection.)
  • Match executes the alignment procedure once all required parameters are set.
  • Quit "END-OF-INPUT" keyword used to exit from the program, and to terminate the custom alphabet defining mode.
  • Score defines the scoring method used to obtain the alignment(s).
  • Sequence defines the input files where sequences are to be read and aligned. Also used to define an output file containing the alignments.
  • Title defines a descriptive label written on the program's output.
  • Width allows the user to change the programs output width (useful if you are using a 132 column printer or terminal.)
  • ZZ No operation. Allows the user to comment the input file.

The Help Command

The HELP command is intended to provide someone who has entered the program accidentally with enough information to allow them to exit gracefully. The help command DOES NOT access a built-in help facility.

USAGE: "HELP"

The Score Command: How to choose a scoring method

The SCORE command is used to select a scoring method for the two sequences. (This is used to select the VALUE table discussed earlier.) There are four diferent types of scoring methods that the user can choose from:

Vector method       - Choose among X1, X2 and MATCH.
             
Table method        - Choose among PAM 40, PAM 80, PAM120, 
                      PAM200, PAM250, PAM 320, PROPERTIES, 
                      PET250, MUTMTX and STRUCTURE; you have 
                      the option of changing the default gap 
                      and newgap penalties if you select these 
                      tables.
              
User-defined vector - A scoring table where all matches 
                      are given an equal score, and all
                      mismatches are given an equal penality.
             
User-defined matrix - The  user  inputs  the  entire  scoring
                      matrix for the alphabet selected. 
    
     **** You MUST select a scoring method! ****

Users who are comparing protein sequences will probably want to use the table scoring method - either the Dayhoff PAM matrices, the PROPERTIES matrix, or the STRUCTURE matrix. If PAM matrices are chosen, then the general rule is that the lower PAM will nbring out strong but short matches, while the higher PAM matrices, such as the PAM 320 matrix, will bring out long but weak matches. A good analysis of the appropriate Dayhoff PAM matrix for each individual case can be found in the article "Amino Acid SUbstitution Matrices from an Information Theoretic Perspective" by Stephen Altschul ("Journal of Molecular Biology", 1991, vol. 219, pp. 555-565.) He suggests that for database searching problems, users should choose the PAM 120 if the user is limited to a single search. If three searches are permitted, Altschul suggests using the PAM 40, PAM 120, and PAM 250 matrices. For pairwise alignments, he suggests that the PAM 200 matrix be used. If two matrices are to be used, he suggests using either the PAM 80 in conjunction with the PAM 250 or the PAM 120 paired with the PAM 320 matrix.

Vector Method

To choose a default scoring vector, simply enter the SCORE command followed by the default vector name. Currently the following scoring vectors are available:

* X1     -  scoring table with match=1.0,
            mismatch=-0.9, gap=-2.0, newgap=0.0,
            and cutoff=5.9.
                
* X2     -  scoring table with match=1.0,
            mismatch=0.0, gap=-1.0, newgap=-2.5,
            and cutoff=5.9.
                
* MATCH  -  scoring table with match=1.0,
            mismatch=-1000.0, gap=-1000.0,
            newgap=-0.0, and cutoff=1.0.
 

The Table Method

To choose a default scoring table, simply enter the SCORE command followed by the default table name. You can also select your own GAP and NEWGAP (open gap) penalities by entering them after the table name (for example, PAM 250, gap=-5.0, newgap=-5.0.) Currently, the following scoring tables are available:

*PAM40     - Dayhoff PAM-40 based similarity scoring 
             matrix (the PAM 40 matrix is shown in 
             the appendix) The default value for 
             gap=-13.0, newgap=0.0
        
*PAM80     - Dayhoff PAM-80 based similarity scoring 
             matrix (the PAM 80 matrix is shown in 
             the appendix) The default value for gap=-9.0, 
             newgap=0.0
                
*PAM120    - Dayhoff PAM-120 based similarity scoring 
             matrix (the PAM 120 matrix is shown in 
             the appendix)  The default value for gap=-7.0, 
             newgap=0.0
                
*PAM200    - Dayhoff PAM-200 based similarity scoring 
             matrix (the PAM 200 matrix is shown in 
             the appendix)  The default value for gap=-5.0, 
             newgap=0.0
 
*PAM250    - Dayhoff PAM-250 based similarity scoring 
             matrix (the PAM 250 matrix is shown in 
             the appendix)   The default value for gap=-8.0, 
             newgap=0.0
 
*PAM320    - Dayhoff PAM-320 based similarity scoring 
             matrix (the PAM 320 matrix is shown in 
             the appendix)   The default value for gap=-3.0, 
             newgap=0.0 

*STRUCTURE - Structure Genetic similarity scoring 
             matrix. (The Structure-Genetic matrix is 
             shown in the appendix).  The default value for 
             gap=-2.0, newgap=0.0 
 
*PROPERTIES- Properties similarity scoring matrix. The 
             default value for gap=-3.0, newgap=0.0  
             (The Properties matrix is shown in the appendix).

*MUTMTX    - The Gonnet-Cohen-Benner mutation matrix. The 
             default value for gap=-16.0, newgap=-206.0
             (This matrix is shown in the appendix).

*PET250    - The 1991 Pairwise Exchange Table (PET) matrix 
             (250 PAM)  The default value for gap= -4.0 
             newgap=0.0 (This matrix is shown in the appendix).

The User Defined Vector

To select a user-defined vector scoring scheme, simply enter the vector parameters (and parameter values) after the SCORE command. The vector scoring parameters are:

* MATCH    - assign this score when two letters match.

* MISMATCH - assign this score when two letters do not 
             match.
              
* GAP      - assign this score when a gap is extended 
             in length.
 
* NEWGAP   - assign this score (in addition to the gap 
             score) when a new gap is opened.
 
Note: The values associated with these parameters should
      be entered as REAL (floating point) numbers.

The User Defined Matrix

To use a user defined scoring matrix simply enter the word "MATRIX" after the SCORE keyword. The user should enter the GAP and NEWGAP penality followed only the lower triangular portion of the desired scoring matrix. The upper triangular portion of the matrix will be automatically generated by symmetry. You may insert as many spaces as needed between numbers to line up the columns of the scoring matrix. An example of a scoring matrix for the alphabet {A},{C},{G},{T} is given below.

MAXSEGS> SCORE MATRIX
*****    Using "MATRIX" scoring scheme.
Please enter the GAP penality: -4.0
Please enter the NEWGAP penality: -40.0
Please enter the matrix for the appropriate pair. 
Only enter the lower triangular part of the matrix. 
The Upper portion of the matrix is filled in 
automatically.
        A      C      G      T      N
A     10.0
C     -1.0   20.0
G     -2.0   -5.0   30.0
T     -3.0   -6.0   -8.0   40.0
N     -4.0   -7.0   -9.0  -10.0   50.0
 
Usage: "SCORE <vector_name>"
       "SCORE <table_name>"
       "SCORE <table_name>,GAP=<REAL>,NEWGAP=<REAL>"
       "SCORE GAP=<REAL>,NEWGAP=<REAL>,MATCH=<REAL>,MISMATCH=<REAL>"
       "SCORE MATRIX"

The Echo Command

The ECHO command toggles command line echoing. This command is most useful when the program is running in batch mode.

Usage: "ECHO"

The Title Command: Labeling your Output

The TITLE command allows the user to write a title to the output stream. This command is used primarily for the users identification purposes. To label your output simply enter the TITLE command followed by the label. The label should be less than 125 characters. Spaces are kept, and the title does not have to be in quotes.

Usage: "TITLE <descriptive_title>"

The Alphabet Command: How to select the proper Alphabet

The ALPHABET command is used to set the sequence alphabet to either one of several default alphabets or a user defined "set" alphabet. You must choose an alphabet before you issue the "MATCH" command. Default alphabets that are availiable are:

- PROTEIN   - Alphabet suitable for protein sequences 
              (23 characters)
 
- NUCLEIC   - Alphabet suitable for nucleic acid sequences 
              (5 characters)
 
- AMBIGUOUS - Alphabet suitable for nucleic acid sequences 
              (15 characters)

Default Alphabets

The protein alphabet is defined as: (where the notation "{ }" is used as set notation)

{A,a}, {B,b}, {C,c}, {D,d}, {E,e}, {F,f}, {G,g},
{H,h}, {I,i}, {L,l}, {M,m}, {N,n}, {P,p}, {Q,q},
{R,r}, {S,s}, {T,t}, {V,v}, {W,w}, {X,x}, {Y,y}, 
{Z,z}

This notation means that every letter between the "{}" has the same meaning. For example, {A,a} means that the lowercase "a" and the uppercase "A" are treated as equivalent letters.)

The nucleic alphabet is defined as:
     {A,a},  {C,c},  {G,g},  {N,X,n,x}, {T,U,t,u}
 
The ambiguous alphabet is defined as:
     {A,a}, {B,b}, {C,c}, {D,d}, {G,g}, {H,h},  {K,k},
     {M,m}, {N,X,n,x}, {R,r}, {S,s}, {T,U,t,u}, {V,v},
     {W,w}, {Y,y} 

User Defined Alphabets

In addition to being able to select among several default alphabets, the user can also define a custom alphabet. The user can define several letters which appear in the sequence data as equivalent and represented in the output as a single letter. If the user enters the ALPHABET command without a parameter, he or she must specify the alphabet followed by the keyword QUIT. For example, if one would want to define a purine/pyrimidine alphabet R,Y over the letters A,G,C,T,U one would enter the following:

MAXSEGS> ALPHABET
Please enter the alphabet (form A=B,C,D).
Use only one "=" per line, enter QUIT as the only
word on the line to end the alphabet-entering mode.
R=A,G
Y=C,T,U
QUIT
MAXSEGS>

The letters A, and G are now all equivalent to "R". T, U, and C are now equivalent to "Y".

The following is an example of how to define a neutral-polar (P), neutral-nonpolar (N), acidic (A), and basic (B) alphabet.

MAXSEGS> ALPHABET
Please enter the alphabet (form A=B,C,D).
Use only one "=" per line, enter QUIT as the only 
word on the line to end the alphabet-entering mode.
P=S,T,Y,W,N,Q,C
N=G,A,V,I,L,F,P,M
A=D,E
B=K,R,H
QUIT
MAXSEGS>
 
           
Usage: "ALPHABET <pre-defined-name>"
       "ALPHABET"

The Width Command

The WIDTH command is used to change the output width. By default, the sequence alignments output width is set to 80 characters. The WIDTH command can be used to set the output width between 40 and 132 characters. To change the output width, enter the WIDTH command followed by the output width.

Usage:"WIDTH=<INTEGER>"

The Limit Command: Limiting the amount of Program Output

The LIMIT command is used to define the number of sequence alignments to be retreived and displayed. The LIMIT command takes up to three parameters:

* CUTOFF=X  - Causes the program to display only  
              alignments that have a similarity
              score of "X" or greater.
 
* NUMBER=Y  - Causes the program not to display 
              any more than "Y" best alignments 
              for any pair of sequences, regardless
              of how high the additional 
              best sub-alignments scored.
 
* QUALITY=Z - Causes the program to eliminate any 
              alignments with a quality less than Z.
              Quality is defined as the total alignment 
              score divided by the alignment length.
              Quality can be thought of as alignment
              density.

Note: If you use any limit parameter, alignments will be printed until any of the above limits is reached.

Usage: "LIMIT CUTOFF=<INTEGER>,NUMBER=<INTEGER>,QUALITY=<REAL>"

The Match Command: Finding the Optimal Alignments

The MATCH command is used only after an alphabet and a scoring routine have been defined. The MATCH command causes the program to compute the similarity scores for the given sequences and to display the alignments (up to the limits imposed by the LIMIT command, or by the limits associated with the scoring method selected.) The MATCH command has two distinctive modes: One allowing for the selection of individual sequences from the sequence files (PICK mode - PICK mode is unavailable in the CM-2 implementation) and one allowing the user to compare all sequences (ALL mode).

The "ALL" mode can take an optional parameter:
 
     * ALL  -  Parameter used if the user wishes
               to align all of the query sequences 
               with all of the library sequences.  
               This is the default mode.
 
     * PAIRWISE - Parameter used if the library
                  file and the query file have exactly
                  the same sequences in them. This reduces 
                  the amount of comparisons that need to 
                  be done by a factor of 2.

The "PICK" mode must utilize at least two parameters:
 
     * LIBRARY=<INDEX> -Select the sequence in the 
                        library file refered to by . The 
                        index is either the Locus name (for a 
                        GenBank sequence), Sequence identifier 
                        (for an EMBL or SWISS-PROT sequence) or 
                        the sequence name (for NBRF sequences). 
                        This parameter is required to use the 
                        pick mode.
 
     * QUERY=<INDEX>  - Select the sequence in the 
                        query file refered to by . The 
                        index is either the Locus name (for a 
                        GenBank sequence), Sequence identifier 
                        (for an EMBL or SWISS-PROT sequence) or 
                        the sequence name (for NBRF sequences). 
                        This parameter is required to use the 
                        pick mode.
 
     * SCORE=<REAL>  -  Optional parameter used to 
                        indicate what the score of the alignment 
                        is.  This parameter normally should NEVER 
                        be used by the user, because the program 
                        assumes that the comparison was already
                        done, and the user simply wishes to do 
                        final analysis. This parameter should 
                        only be used if you are using the 
                        "distributed" MAXSEGS program (in that 
                        case the maxsegs program will fill in
                        these values for you automatically.)   
 
Usage: "MATCH ALL"
       "MATCH PAIRWISE"
       "MATCH LIBRARY=<INDEX>, QUERY=<INDEX>"
 

The Sequence Command: How to select Sequences

The SEQUENCE command is used to indicate what files the query and library sequences are to be taken from. It is also used to indicate what file name the results should be written to. (On the CM-2 the "results" file is a Cray MAXSEGS input file, while on the Cray the "results" file contains sequence alignments. The following parameters have been incorporated with the SEQUENCE command:

* LIBRARY - keyword which indicates that the library 
            sequence(s) are to be taken from the named file.

* QUERY  -  keyword  which indicates that the query 
            sequence(s) are to be taken from the named file.

* RESULTS - keyword which indicates that the program
            results (the alignment(s) on the Cray and 
            the input file on the CM-2) are to be written 
            to this named file.
 
Usage: "SEQUENCE LIBRARY=<file>, QUERY=<file>, RESULT=<file>" 

The Quit Command

The QUIT command is used to terminate the programs and to return the user to the operating system level. The QUIT command is equivalent to the "END-OF-FILE" marker, and entering the "END-OF-FILE" character will terminate the program as well.

Usage: "QUIT"

Sample Program Input

The following is sample program input to the Maxsegs program. The first line places a title on the output. The second line tells the program that the alignments should not have more than 80 characters per line. The third line selects the protein alphabet. The fourth line selects the scoring method. In this case, the Dayhoff PAM120 matrix has been selected. The Gap extension penality is -8 while the Gap open penality is 0. The fifth line contains information to limit the amount of output. The program will produce no more than the 2 best sub-alignments between any pair of sequences. All alignments will need to score 50 or higher to be produced. The sixth line tells the program that our library of sequences is in the file "LIBRARY" while our query sequence is in the file "QUERY". We want the alignments written to the file "RESULT". The next line indicates that we want all of the sequences in the query file matched with all of the sequences in the library file. The last line ends the program.

TITLE Rattle snake sequence (PDB code=1PP2R) vs swiss-prot    
WIDTH =  80                                                     
ALPHABET PROTEIN                                             
SCORE PAM120, GAP=   -8, NEWGAP=    0                        
LIMIT CUTOFF= 50  NUMBER=  2                                 
SEQUENCE LIBRARY=LIBRARY QUERY=QUERY RESULT=RESULT                
MATCH ALL                                                         
QUIT 

Appendices

  1. Dayhoff PAM 40 Scoring Matrix
  2. Dayhoff PAM 80 Scoring Matrix
  3. Dayhoff PAM 120 Scoring Matrix
  4. Dayhoff PAM 200 Scoring Matrix
  5. Dayhoff PAM 250 Scoring Matrix
  6. Dayhoff PAM 320 Scoring Matrix
  7. Structure-Genetic Scoring Matrix
  8. Properties Scoring Matrix
  9. 1991 Pairwise Exchange Table (PET) at 250 PAM
  10. Gonnett Mutation Matrix
  11. Nucleotides
  12. Amino Acids

Dayhoff PAM 40 Scoring Matrix

     A    R    N    D    C    Q    E    G    H    I    L   

A    6
R   -6    8
N   -3   -5    7
D   -3   -9    2    7
C   -6   -7   -9  -12   10
Q   -4   -1   -3   -2  -12    8
E   -2   -8   -1    3  -12    2    7
G   -1   -9   -2   -2  -10   -6   -3    6
H   -6   -1    1   -3   -6    1   -4   -9    9
I   -4   -5   -4   -6   -6   -6   -5   -9   -9    8
L   -5   -7   -7  -11  -13   -4   -8   -8   -6   -1    7
K   -6    1    0   -4  -12   -2   -4   -6   -5   -5   -7   
M   -4   -4   -7  -10  -12   -3   -8   -9   -9    0    1 
F   -7   -7   -7  -12  -11  -11  -12   -8   -5   -2   -2  
P   -1   -3   -5   -6   -7   -2   -4   -5   -3   -8   -6  
S    0   -2    0   -3   -2   -4   -3   -1   -5   -6   -8    
T    0   -5   -1   -4   -6   -5   -5   -5   -6   -5   -6  
W  -12   -1   -9  -14  -14  -12  -15  -13   -9  -12  -10 
Y   -7  -11   -4  -10   -3  -10   -7  -12   -3   -6   -6  
V   -2   -6   -6   -7   -5   -6   -6   -5   -6    2   -2  
B   -2   -5    7    7  -10   -1    3   -1    0   -4   -7    
Z   -1   -2   -1    3  -11    7    7   -3    1   -4   -4    
X    1   -1   -1   -1   -1   -1   -1   -1   -1   -1   -1     
	
     K    M    F    P    S    T    W    Y    V    B    Z    X
K    6
M   -1   11
F  -12   -4    9
P   -5   -8  -10    8
S   -3   -5   -6   -1    6
T   -2   -3   -7   -3    1    7
W  -10  -12   -3  -12   -4   11   13
Y  -10  -10   -2  -12   -6   -6   -3    9
V   -7   -1   -6   -5   -5   -2  -15   -6    7  
B   -1   -2   -3   -4    0   -1  -10   -4   -6    8
Z   -2   -4  -11   -2   -3   -4  -12   -7   -5    2    8
X   -1   -1   -1   -1   -1   -1   -1   -1   -1   -1   -1   -1

Dayhoff PAM 80 Scoring Matrix

    A    R    N    D    C    Q    E    G    H    I    L 
A   4
R  -4    7    
N  -1   -2    6   
D  -1   -5    3    6
C  -4   -5   -6   -9    9  
Q  -2    0   -1    0   -9    7
E  -1   -4    0    4   -9    2    6
G   0   -6   -1   -1   -7   -4   -2    6
H  -4    0    2   -1   -5    2   -2   -6    8
I  -2   -3   -3   -4   -4   -4   -3   -6   -6    7
L  -4   -5   -5   -7   -9   -3   -5   -6   -4    1    6
K  -4    2    0   -2   -9   -1   -2   -4   -3   -3   -5   
M  -3   -2   -4   -6   -8   -2   -5   -6   -6    1    2   
F  -5   -5   -5   -8   -8   -8   -8   -6   -3    0    0  
P   0   -2   -3   -4   -4   -1   -2   -3   -2   -5   -4    
S   1   -1    1   -1   -1   -3   -2    0   -3   -4   -5      
T   1   -3    0   -2   -4   -3   -3   -2   -4   -1   -4    
W  -8    0   -6  -10  -10   -8  -11  -10   -6   -9   -7     
Y  -5   -8   -2   -7   -2   -7   -5   -8   -1   -4   -4              
V   0   -4   -4   -5   -3   -4   -4   -3   -4    3    0   
B   0   -3    5    6   -6    1    3    0    1   -3   -5       
Z   0    0    1    3   -8    6    6   -2    2   -3   -3      
X  -1   -1   -1   -1   -1   -1   -1   -1   -1   -1   -1 
                                                                      
    K    M    F    P    S    T    W    Y    V    B    Z    X        
K   6
M   0    9
F  -8   -2    8
P  -3   -5   -7    7
S  -2   -3   -4    0    4
T  -1   -2   -5   -1    2    5 
W  -7   -8   -2   -8   -3   -8   13
Y  -7   -6    4   -8   -8   -4   -2    9
V  -5    1   -4   -3   -3   -1  -11   -4    6
B   1   -4   -5   -2    1    0   -7   -3   -3    7
Z   0   -2   -7   -1   -1   -2   -8   -5   -3    3    7 
X  -1   -1   -1   -1   -1   -1   -1   -1   -1   -1   -1   -1

Dayhoff PAM 120 Scoring Matrix

     A    R    N    D    C    Q    E    G    H    I    L   
A    3
R   -3    6
N    0   -1    4
D    0   -3    2    5
C   -3   -4   -5   -7    9
Q   -1    1    0    1   -7    6
E    0   -3    1    3   -7    2    5
G    1   -4    0    0   -5   -3   -1    5
H   -3    1    2    0   -4    3   -1   -4    7  
I   -1   -2   -2   -3   -3   -3   -3   -4   -4    6
L   -3   -4   -4   -5   -7   -2   -4   -5   -3    1    5  
K   -2    2    1   -1   -7    0   -1   -3   -2   -2   -4  
M   -2   -2   -3   -4   -6   -1   -4   -4   -4    1    3 
F   -4   -4   -4   -7   -6   -6   -6   -5   -2    0    0 
P    1   -1   -2   -2   -3    0   -1   -2   -1   -3   -3 
S    1   -1    1    0   -1   -2   -1    1   -2   -2   -4   
T    1   -2    0   -1   -3   -2   -2   -1   -3    0   -3  
W   -7    1   -5   -8   -8   -6   -8   -8   -5   -5   -5
Y   -4   -6   -2   -5   -1   -5   -4   -6   -1   -2   -3  
V    0   -3   -3   -3   -2   -3   -3   -2   -3    3    1  
B    0   -2    3    3   -6    0    2    0    1   -2   -4   
Z    0   -1    0    2   -7    4    3   -2    1   -3   -3  
X   -1   -1   -1   -1   -1   -1   -1   -1   -1   -1   -1
                
     K    M    F    P    S    T    W    Y    V    B    Z    X
K    5
M    0    8
F   -6   -1    8
P   -2   -3   -5    6
S   -1   -2   -3    1    3
T   -1   -1   -2   -4    2    4
W   -5   -7   -1   -7    2    6   12
Y   -6   -4    4   -6   -3   -3   -1    8
V   -4    1   -3   -2   -2    0   -8   -3    5
B    0   -3   -5   -2    0    0   -6   -3   -3    3
Z    0   -2   -2    0   -1   -2   -7   -4   -3    1    3    
X   -1   -1   -1   -1   -1   -1   -1   -1   -1   -1   -1   -1

Dayhoff PAM 200 Scoring Matrix

     A    R    N    D    C    Q    E    G    H    I    L     
A    2
R   -2    5
N    0    0    2 
D    0   -1    2    3
C   -2   -3   -3   -4    8
Q   -1    1    0    1   -4    4
E    0   -1    1    3   -4    2    3
G    1   -3    0    0   -3   -1    0    4 
H   -1    1    1    0   -3    2    0   -2    5
I   -1   -2   -1   -2   -2   -2   -2   -2   -2    4
L   -2   -2   -2   -3   -5   -1   -3   -3   -2    2    4
K   -1    2    1    0   -4    0    0   -2    0   -1   -2  
M   -1    0   -2   -2   -4   -1   -2   -3   -2    2    3  
F   -3   -3   -2   -4   -4   -4   -4   -3   -1    1    1 
P    1    0    0   -1   -2    0    0   -1    0   -2   -2 
S    1    0    1    0    0   -1    0    1   -1   -1   -2 
T    1   -1    0    0   -2   -1    0    0   -1    0   -1  
W   -5    1   -4   -6   -6   -4   -6   -5   -3   -5   -4   
Y   -3   -4   -1   -3    0   -3   -3   -4    0   -1   -1  
V    0   -2   -2   -2   -2   -2   -2   -1   -2    3    1  
B    1    0    3    4   -3    2    3    1    2   -1   -2  
Z    1    1    2    3   -3    4    4    0    2   -1   -1   
X   -1   -1   -1   -1   -1   -1   -1   -1   -1   -1   -1 

     K    M    F    P    S    T    W    Y    V    B    Z    X
K    4
M    1    5
F   -4    0    7
P   -1   -2   -4    5
S    0   -1   -2    1    2
T    0    0   -2    0    1    3
W   -3   -4    0   -5   -2   -4   12
Y   -4   -2    5   -4   -2   -2    0    7
V   -2    1   -2   -2   -2    0   -6   -2    4
B    1   -1   -2    0    1    1   -3   -1   -1    4
Z    1    0   -3    1    1    0   -4   -2   -1    4    5
X   -1   -1   -1   -1   -1   -1   -1   -1   -1   -1   -1   -1

Dayhoff PAM 250 Scoring Matrix

      A    R    N    D    C    Q    E    G    H    I    L
A     2
R    -2    6
N     0    0    2
D     0   -1    2    4
C    -2   -4   -4   -5   12
Q     0    1    1    2   -5    4
E     0   -1    1    3   -5    2    4
G     1   -3    0    1   -3   -1    0    5
H    -1    2    2    1   -3    3    1   -2    6
I    -1   -2   -2   -2   -2   -2   -2   -3   -2    5
L    -2   -3   -3   -4   -6   -2   -3   -4   -2    2    6
K    -1    3    1    0   -5    1    0   -2    0   -2   -3
M    -1    0   -2   -3   -5   -1   -2   -3   -2    2    4
F    -4   -4   -4   -6   -4   -5   -5   -5   -2    1    2
P     1    0   -1   -1   -3    0   -1   -1    0   -2   -3
S     1    0    1    0    0   -1    0    1   -1   -1   -3
T     1   -1    0    0   -2   -1    0    0   -1    0   -2
W    -6    2   -4   -7   -8   -5   -7   -7   -3   -5   -2
Y    -3   -5   -2   -4    0   -4   -4   -5    0   -1   -1
V     0   -2   -2   -2   -2   -2   -2   -1   -2    4    2   
B     0   -1    2    3   -4    1    2    0    1   -2   -3
Z     0    0    1    3   -5    3    3   -1    2   -2   -3
X     0   -1    0   -1   -3   -1   -1   -1   -1   -1   -1

      K    M    F    P    S    T    W    Y    V    B    Z    X
K     5
M     0    6
F    -5    0    9
P    -1   -2   -5    6
S     0   -2   -3    1    2
T     0   -1   -3    0    1    3
W    -3   -4    0   -6   -2   -5   17
Y    -4   -2    7   -5   -3   -3    0   10
V    -2    2   -1   -1   -1    0   -6   -2    4
B     1   -2   -5   -1    0    0   -5   -3   -2    2
Z     0   -2   -5    0    0   -1   -6   -4   -2    2    3
X    -1   -1   -2   -1    0    0   -4   -2   -1    0   -1    1

Dayhoff PAM 320 Scoring Matrix

      A    R    N    D    C    Q    E    G    H    I    L
A     1
R    -1    3
N     0    0    1
D     0    0    1    2
C    -1   -2   -2   -3    7
Q     0    1    1    1   -3    2
E     0    0    1    2   -3    1    2
G     1   -1    0    1   -2    0    0    2
H    -1    1    1    0   -2    1    0   -1    3
I     0   -1   -1   -1   -1   -1   -1   -1   -1    2
L    -1   -1   -1   -2   -3   -1   -2   -2   -1    1    3
K     0    2    1    0   -3    1    0   -1    0   -1   -1   
M    -1    0   -1   -1   -3    0   -1   -1   -1    1    2  
F    -2   -2   -1   -3   -2   -2   -2   -2   -1    1    1 
P     1    0    0    0   -1    0    0    0    0   -1   -1  
S     1    0    0    0    0    0    0    1    0   -1   -1
T     1    0    0    0   -1    0    0    0   -1    0   -1 
W    -3    1   -2   -4   -4   -3   -4   -4   -2   -3   -2
Y    -2   -2   -1   -2    0   -2   -2   -3    0    0    0 
V     0   -1   -1   -1   -1   -1   -1   -1   -1    2    1 
B     1    1    2    2   -1    2    2    1    2    0   -1  
Z     1    1    2    2   -2    2    3    1    2    0    0
X    -1   -1   -1   -1   -1   -1   -1   -1   -1   -1   -1

      K    M    F    P    S    T    W    Y    V    B    Z    X
	
K     2
M     0    3
F    -2    0    5
P     0   -1   -2    3
S     0   -1   -2    1    1
T     0    0   -1    0    1    1
W    -2   -3    1   -3   -1   -3   11
Y    -3   -1    4   -2   -2   -1    1    6
V    -1    1    0   -1    0    0   -4   -1    2
B     1    0   -1    1    1    1   -2   -1    0    3
Z     1    0   -1    1    1    1   -2   -1    0    3    4
X    -1   -1   -1   -1   -1   -1   -1   -1   -1   -1   -1   -1

Structure-Genetic Scoring Matrix

      A    R    N    D    C    Q    E    G    G    I    L
A     4
R     0    4
N     1    0    4
D     2    0    3    4
C     0    0    0   -1    4
Q     1    1    1    2   -1    4
E     2    0    1    3   -2    2    4
G     3    1    1    2    1    0    2    4
H     0    2    2    1    0    2    0   -1    4
I     0    0    0   -1    0   -1   -1    0   -1    4
L     0    0   -1   -1    0    0   -1    0    1    3    4
K     1    3    2    1   -2    2    2    0    1    0    0
M     0    0   -1   -2    0    0   -1   -1   -1    2    3
F     0   -1   -1   -1    1   -1   -2   -1    0    2    2
P     3    1    0    0    0    1    1    1    1    0    1
S     3    1    3    1    2    1    1    3    1    0    0
T     3    1    2    0    0    1    1    0    0    1    0
W     0    0   -2   -2    1   -1   -1    1   -1    0    2
Y     0    2    1    0    1    0   -1    0    1    1    1
V     3    0    0    1    0    0    2    2   -1    3    3
B     2    0    3    3    0    2    2    2    2    0   -1
Z     2    1    1    3   -2    3    3    1    1   -1    0
X     1    1    1    1    0    1    1    1    1    1    1
            
      K    M    F    P    S    T    W    Y    V    B    Z    X
K     4
M     0    4
F    -2   -1    4
P     0    0    0    4
S     1   -1    1    2    4
T     2    1   -1    2    3    4
W    -1    1    1    0    0   -1    4
Y    -1    0    3    0    1    0    1    4
V     1    2    2    1    0    1    1    1    4
B     2   -2   -1    0    2    1   -2    1    1    3
Z     0    0   -2    1    1    1   -1   -1    1    2    3
X     1    0    0    1    1    1    0    0    1    1    1   -2

Properties Scoring Matrix

      A    R    N    D    C    Q    E    G    H    I    L
A     5
R    -2    5
N     1    1    5
D     0    1    2    5
C     3   -1    2    0    5
Q     0    2    3    1    1    5
E    -2    2    1    3   -1    2    5
G     4   -2    1   -1    0    0   -2    5
H    -2    2   -1   -1   -1    0    0   -2    5
I     1   -1    0   -2    2    1   -1    1   -1    5
L     1   -1    0   -2    2    1   -1    1    1    4    5
K    -1    3    0    0    0    1    1   -1    3    0    0
M     2    0    1   -1    3    2    0    2    0    3    3
F     1   -1    0   -2    2    1   -1    1    1    2    2
P     1   -1    2    0    2    1   -1    1   -3    0    0
S     2    0    3    1    1    2    0    2   -2   -1   -1
T     2    0    3    1    3    2    0    2    0    1    1
W     0    0    1   -1    1    2    0    0    2    1    1
Y     0    0    1   -1    1    2    0    0    2    1    1
V     2   -2    1   -1    3    0   -2    2   -2    3    3
B     1    1    4    4    1    2    2    0   -1   -1   -1
Z    -1    2    2    2    0    4    4   -1    0    0    0
X     1    0    2    0    1    2    0    1    0    1    1

      K    M    F    P    S    T    W    Y    V    B    Z    X
K     5
M     1    5
F     0    3    5
P    -2    1    0    5
S    -1    0   -1    1    5
T     1    2    1    1    2    5
W     1    2    3   -1    0    0    5
Y     1    2    3   -1    0    0    4    5
V    -1    2    1    1    0    2    0    0    5
B     0    0   -1    1    2    2    0    0    0    2
Z     1    1    0    0    1    1    1    1   -1    2    2
X     1    2    1    0    1    2    1    1    1    1    1   -3
 

1991 Pairwise Exchange Table (PET) at 250 PAM

     A    R    N    D    C    Q    E    G    H    I    L 
A    2 
R   -1    5 
N    0    0    3 
D    0   -1    2    5 
C   -1   -1   -1   -3   11 
Q   -1    2    0    1   -3    5 
E   -1    0    1    4   -4    2    5 
G    1    0    0    1   -1   -1    0    5 
H   -2    2    1    0    0    2    0   -2    6 
I    0   -3   -2   -3   -2   -3   -3   -3   -3    4 
L   -1   -3   -3   -4   -3   -2   -4   -4   -2    2    5 
K   -1    4    1    0   -3    2    1   -1    1   -3   -3 
M   -1   -2   -2   -3   -2   -2   -3   -3   -2    3    3 
F   -3   -4   -3   -5    0   -4   -5   -5    0    0    2 
P    1   -1   -1   -2   -2    0   -2   -1    0   -2    0 
S    1   -1    1    0    1   -1   -1    1   -1   -1   -2 
T    2   -1    1   -1   -1   -1   -1   -1   -1    1   -1 
W   -4    0   -5   -5    1   -3   -5   -2   -3   -4   -2 
Y   -3   -2   -1   -2    2   -2   -4   -4    4   -2   -1 
V    1   -3   -2   -2   -2   -3   -2   -2   -3    4    2 
B    0   -1    3    4   -2    1    3    1    1   -3   -4 
Z   -1    1    1    3   -2    4    4   -1    1   -3   -3 
X   -1   -1   -1   -1   -1   -1   -1   -1    0   -1   -1 
                                                                   
     K    M    F    P    S    T    W    Y    V    B    Z    X  
K    5  
M   -2    6  
F   -5    0    8  
P   -2   -2   -3    6  
S   -1   -1   -2    1    2  
T   -1    0   -2    1    1    2  
W   -3   -3   -1   -4   -3   -4   15  
Y   -3   -2    5   -3   -1   -3    0    9  
V   -3    2    0   -1   -1    0   -3   -3    4  
B    1   -3   -4   -2    1    0   -5   -2   -2    3  
Z    2   -3   -5   -1   -1   -1   -4   -3   -3    3    3  
X   -1   -1   -1   -1   -1   -1   -2   -1   -1   -1   -1   -1  

Gonnett Mutation Matrix

      A     R     N     D     C     Q     E     G     H     I     L 
A    24   
R    -6    47 
N    -3     3    38 
D    -3    -3    22    47 
C     5   -22   -18   -32   115 
Q    -2    15     7     9   -24    27 
E    -0     4     9    27   -30    17    36 
G     5   -10     4     1   -20   -10    -8    66 
H    -8     6    12     4   -13    12     4   -14    60 
I    -8   -24   -28   -38   -11   -19   -27   -45   -22    40 
L   -12   -22   -30   -40   -15   -16   -28   -44   -19    28    40 
K    -4    27     8     5   -28    15    12   -11     6   -21   -21 
M    -7   -17   -22   -30    -9   -10   -20   -35   -13    25    28 
F   -23   -32   -31   -45    -8   -26   -39   -52    -1    10    20 
P     3    -9    -9    -7   -31    -2    -5   -16   -11   -26   -23 
S    11    -2     9     5     1     2     2     4    -2   -18   -21 
T     6    -2     5    -0    -5    -0    -1   -11    -3    -6   -13 
W   -36   -16   -36   -52   -10   -27   -43   -40    -8   -18    -7 
Y   -22   -18   -14   -28    -5   -17   -27   -40    22    -7    -0 
V     1   -20   -22   -29    -0   -15   -19   -33   -20    31    18 
B    -3    -0    30    35   -25     8    18     3     8   -33   -35 
Z    -1    10     8    18   -27    22    27    -9     8   -23   -22 
X    -4    -5    -5    -9    -8    -3    -7   -15    -0    -9    -9 
      
      K     M     F     P     S     T     W     Y     V     B     Z     X  
K    32  
M   -14    43
F   -33    16     7 
P    -6   -24   -38    76 
S     1   -14   -28     4    22 
T     1    -6   -22     1    15    25  
W   -35   -10    36   -50   -33   -35   142  
Y   -21    -2    51   -31   -19   -19    41    78  
V   -17    16     1   -18   -10    -0   -26   -11    34 
B     7   -26   -38    -8     7     3   -44   -21   -26    32 
Z    14   -15   -33    -4     2    -1   -35   -22   -17    57    24 
X    -5    -5   -12   -11    -4    -4   -13    -4    -7   -14   -10    -7 

Nucleotides

NAME                    CODE      MEANING

Adenine                 A         A
Cytosine                C         C
Guanine                 G         G
Thymine/Uracil          T/U       T/U
                        M         A or C
                        R         A or G
                        W         A or T/U
                        S         C or G
                        Y         C or T/U
                        K         G or T/U
                        V         A or C or G
                        H         A or C or T/U
                        D         A or G or T/U
                        B         C or G or T/U
                        X/N       A or C or G or T/U

Amino Acids

NAME                    3 LETTER  CODE

Alanine                 Ala       A
Cysteine                Cys       C
Aspartic Acid           Asp       D
Glutamic Acid           Glu       E
Phenylalanine           Phe       F
Glycine                 Gly       G
Histidine               His       H
Isoleucine              Ile       I
Lysine                  Lys       K
Leucine                 Leu       L
Methionine              Met       M
Asparagine              Asn       N
Proline                 Pro       P
Glutamine               Gln       Q
Arginine                Arg       R
Serine                  Ser       S
Threonine               Thr       T
Valine                  Val       V
Tryptophan              Trp       W
Tyrosine                Tyr       Y
Aspartic/Asparagine     Asp,Asn   B
Glutamic/Glutamine      Glu,Gln   Z
Unknown                 Xxx       X

Pittsburgh Supercomputing Center National Resource for Biomedical Supercomputing
An NIH Supported Resource Center
300 S. Craig St. Pittsburgh, PA 15213. Phone: 412-268-4960, Email: biomed@psc.edu

Please send suggestions for improving this code and error reports to biomed@psc.edu.

biomed-www@psc.edu (last updated: 2/17/98)
© 1998 Pittsburgh Supercomputing Center