MakePSSM User Manual

PSC's Position Specific Scoring Matrix (Profile) Maker
Version 1.1.1

(C) 1998 Pittsburgh Supercomputing Center - All Rights Reserved.
Written by David W. Deerfield II (deerfiel@psc.edu)

Table of Contents

  1. Theory
  2. Gribskov's Method
  3. Henikoff's Method
  4. References
  5. Options
  6. Known Bugs
  7. Usage
  8. Summary of Options

THEORY

One would ideally like to create a true log-odds matrix score for each position in a multiple sequence alignment. One needs a minimum of five counts for each amino acid to get a reasonable statistical distribution. Assuming a normal composition, you would need about 200 sequences to properly describe a highly mutable position. The required number of sequences increases to perhaps 500 sequences for a highly conserved position to have confidence in the resulting log-odds matrix. Very seldom will one have this many aligned sequences, so how do you develop a log-odds representation that represents the known data and yet has a reasonable underlying model to account for the missing data? Implemented within this program are an implicit (Gribskov's method) and explicit (Henikoff's method) approach towards providing the missing data.

GRIBSKOV METHOD (1):

Best described as a weighted average of the similarity scores. Thus, for Residue "i" in column "j":

          20
S(i,j) = SUM FractionResidue(k) * SM(i,k)
         k=1

SM(i,k) is the Similarity Matrix score for residue "i" replacing "k".

Where (for linear weighing):

                       CountsResidue(k)
FractionResidue(k)  =  ----------------
                         TotalCounts

and:

               20
TotalCounts = SUM SequenceWeight(m)
              m=1

That is, the TotalCounts is NOT the SUM of all CountsResidue(m); but rather, also includes the counts for gaps. This reduces the absolute value of the minimum and maximum in columns with gaps.

It should be noted that the Gribskov method is IMPLICITLY based on pseudocounts to compute the log-odds score.

HENIKOFF'S METHOD (2):

Explicitly based on the log-odds score computation, where:

                                    Number of "i" in environment "j"
                 P(i:j)           ----------------------------------
                                    Total number of residues in "j"
Log-Odd(i) = Log(--------) = Log(---------------------------------------)
                                  Total of Residue "i" in Reference Set
                 P(i)           -------------------------------------
                                   Total Residues in Reference Set


Where:  
P(i:j) =  probability of finding residue "i" in column "j"

P(i)   =  probability of finding residue "i" in the 
                  "reference set."

We also computed the entropy (amount of information) for each score by:

            20                    P(i:j)
Entropy =  SUM     P(i:j) *  Log(--------)
           i=1                     P(i)

P(i) is the probability of finding residue "i" in the "reference set." But which "reference set"? The first possibility would be to use the probabilities found in a database (e.g., PIR), which is appropriate if you are building the profile for a database search. The second (-u) uses the probabilities computed from the other columns in the profile as the background counts, which is appropriate if you were to use the profile to help clean up alignments with distantly related proteins. We offer two choices - use the raw counts ("-v") in the other columns to compute P(i) or use pseudo-counts as described below to smooth out the data (default).

The Henikoffs' extension to the log-odds score is reflected in the P(i:j) term. In a straight log-odds score, this would include only the real counts. The Henikoffs' proposed to use the replacement frequencies underlying the blosum set of matrices (you could also use PAM...) to compute a number of pseudo-counts to provide pseudo-probabilities to try to account for missing data.

P(i:j)(final) = Wgt(real)*P(i:j)(real) + Wgt(pseudo)*B(i:j)(pseudo)

Thus, P(i:j)(final) is a weighted average of the P(i:j)(real) from the observed counts and the B(i:j) term which is the probability based on the pseudo-counts. The key is in the weights, where:

Wgt(real) = TotalCounts(real) / ( TotalCounts(real) + TotalCounts(pseudo) )
Wgt(pseudo) = 1.0 - Wgt(real)

In the Henikoffs' approach (which is the one implemented here), the TotalCounts(pseudo) is the product of a "fudge factor" (-h number) and the number of unique amino acids (UNIQAA) in the specific row. At the limits: The more diversity in the position, the more diversity will be forced by using a larger fraction of pseudocounts. Thus, the influence of the pseudocounts decreases as the number of sequences increases.

FOR BOTH METHODS, The ambiguous amino acids and bases are created as the weighted average of the final probabilities.

REFERENCES:

  1. ) M. Gribskov, R. Luethy, D. Eisenberg, 1990. Profile Analysis. Methods in Enzymology 183:146-159.

  2. ) J.G. Henikoff, S. Henikoff, 1996. Using substitution probabilities to improve position-specific scoring matrices. CABIOS 12:135-143.

OPTIONS

A flag can be set whether this is a protein (-p) or nucleic acid (-n) sequence. If neither flag is set, then if the sequences are in a MSF file the type is taken from the header information. For other files without additional information, it is assumed that these are protein sequences.

What approach should be used to compute position-specific scoring matrix? The choices are: Gribskov's (1) or Henikoff's (2) Method. This is defined with the -a option: "-a Gribskov" or "-a Henikoff"

    If you choose "-a Gribskov", there are the following options:
    -d    perform data averaging (default is to NOT perform data averaging)
    -e    use exponential weighting (default is to use linear weighing)

    If you choose "-a Henikoff", there are the following options:

    -h    can change the multiplicative factor for computing the number of pseudo counts per amino acid used (default is to use a factor of 5.0).
    -u    use the probabilities for the occurance of each amino acid from the counts in the other rows (default is to use the probabilities from PIR).

Following matrices (-m MATRIX) can be used for a protein sequence:

    With Gribskov's method, you can use:

    Blosum30, Blosum35, Blosum40, Blosum45, Blosum50, Blosum55, Blosum60, Blosum62, Blosum65, Blosum70, Blosum75, Blosum80, Blosum85, Blosum90, Blosum95, Blosum100
    Pam40, Pam80, Pam120, Pam160, Pam200, Pam250, Pam320
    Gribskov's Matrix (PEPCMP)

    With Henikoff's method, the substitution frequencies for:

    Blosum30, Blosum40, Blosum50, Blosum60, Blosum70, Blosum80, Blosum90

The Nucleic Acid matrices (-m MATRIX) for either method are:

    1. "DNAPAM"    straight DNA PAM
    2. "DNATT"    DNA with Transistions/Transversions
    3. "PSEUDO"    based on DNA pseudogenes

    at the following distances:

      1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150

    An example would be: -m dnatt60

    You can also use Gribskov's matrix with his Approach.

Which gap penalty model should be used:

    -g Gribskov
    use the Gribskov gap length dependent model
    For columns with gaps, this is of the form:
    gap_multiplier = Gmax / (1.0 + Ginc * Lgap )
    where:
      Gmax is the maximum multiplier for where there are gaps
      Ginc determines the rate at which the gap multiplier decreases with increasing gap penalty
      Lgap is the length of the gap
    This gap model does not consider the number of gaps in each column.

    -g Linear
    use a linear weighing based on the number of gaps
    this approach assigns a different value for open gap and extend gap.

    -g MinMax
    assigns a gap penalty based on log-odds scores. Specifically, the multipler is 80% of biggest or twice the absolute value of the smallest logodds score, whichever is larger.

The region needed for the profile:

    -r start end
    the portion of the multiple sequence alignment that you want the profile from.

Lastly, define the input, output and weight files:

-i InFile
-o OutFile
-w WeightFile

The InFile is required. If the OutFile is NOT defined, then the program will create an OutFile name as the root of the InFile with the extension '.prf'.

    ( NucleicAcid.msf -> NucleicAcid.prf )

The default approach is Henikoff.

The defaults for the Henikoff Approach are:

    If Protein:
      Matrix: Blosum60
      Gap: Linear
    If Nucleic Acid:
      Matrix: DNATT50
      Gap: Linear

The defaults for the Gribskov Approach are:

    If Protein or Nucleic Acid:
      Matrix: Gribskov
      Gap: Gribskov
      Weighing: Linear

KNOWN BUGS: EXPONENTIAL WEIGHING FOR GRIBSKOV APPROACH

A difference between this and the GCG package is in the computation of the ambiguous amino and nucleic acids. This program uses the weighted average of the appropriate similarity score for the Gribskov method and the probability (P(i:j)) for the Henikoff method.

USAGE:

MakePSSM -i Protein.msf -a Henikoff -m Blosum60  \
   -g Linear

SUMMARY OF OPTIONS FOR MakePSSM:

-a approach    "Gribskov" or Henikoff"
-d    with "-a Gribskov", data averaging
-e    with "-a Gribskov", exponential weighing
-g gap_model    "Gribskov", "Linear", "MinMax"
-h multiplier    with "-a Henikoff", for weighing pseudo-counts
-i InFile   
-m matrix    substitution or frequency matrix
-n    nucleic acid sequence
-o OutFile   
-p    protein sequence
-r start end    region to create profile from
-u    with "-a Henikoff", compute P(i) from the columns in the profile
-v    with "-a Henikoff -u", do NOT use pseudocounts to smooth out P(i) computed from profile
-w WeightFile    *NOT IMPLEMENTED YET*