MakePSSM User Manual
PSC's Position Specific Scoring Matrix (Profile) Maker
Version 1.1.1
(C) 1998 Pittsburgh Supercomputing Center - All Rights Reserved.
Written by David W. Deerfield II (deerfiel@psc.edu)
Table of Contents
One would ideally like to create a true log-odds matrix score for each position in a multiple sequence alignment. One needs a minimum of five counts for each amino acid to get a reasonable statistical distribution. Assuming a normal composition, you would need about 200 sequences to properly describe a highly mutable position. The required number of sequences increases to perhaps 500 sequences for a highly conserved position to have confidence in the resulting log-odds matrix. Very seldom will one have this many aligned sequences, so how do you develop a log-odds representation that represents the known data and yet has a reasonable underlying model to account for the missing data? Implemented within this program are an implicit (Gribskov's method) and explicit (Henikoff's method) approach towards providing the missing data.
GRIBSKOV METHOD (1):
Best described as a weighted average of the similarity scores. Thus, for Residue "i" in column "j":
20
S(i,j) = SUM FractionResidue(k) * SM(i,k)
k=1
SM(i,k) is the Similarity Matrix score for residue "i" replacing "k".
Where (for linear weighing):
CountsResidue(k)
FractionResidue(k) = ----------------
TotalCounts
and:
20
TotalCounts = SUM SequenceWeight(m)
m=1
That is, the TotalCounts is NOT the SUM of all CountsResidue(m); but rather, also includes the counts for gaps. This reduces the absolute value of the minimum and maximum in columns with gaps.
It should be noted that the Gribskov method is IMPLICITLY based on pseudocounts to compute the log-odds score.
HENIKOFF'S METHOD (2):
Explicitly based on the log-odds score computation, where:
Number of "i" in environment "j"
P(i:j) ----------------------------------
Total number of residues in "j"
Log-Odd(i) = Log(--------) = Log(---------------------------------------)
Total of Residue "i" in Reference Set
P(i) -------------------------------------
Total Residues in Reference Set
Where:
P(i:j) = probability of finding residue "i" in column "j"
P(i) = probability of finding residue "i" in the
"reference set."
We also computed the entropy (amount of information) for each score by:
20 P(i:j)
Entropy = SUM P(i:j) * Log(--------)
i=1 P(i)
P(i) is the probability of finding residue "i" in the "reference set." But which "reference set"? The first possibility would be to use the probabilities found in a database (e.g., PIR), which is appropriate if you are building the profile for a database search. The second (-u) uses the probabilities computed from the other columns in the profile as the background counts, which is appropriate if you were to use the profile to help clean up alignments with distantly related proteins. We offer two choices - use the raw counts ("-v") in the other columns to compute P(i) or use pseudo-counts as described below to smooth out the data (default).
The Henikoffs' extension to the log-odds score is reflected in the P(i:j) term. In a straight log-odds score, this would include only the real counts. The Henikoffs' proposed to use the replacement frequencies underlying the blosum set of matrices (you could also use PAM...) to compute a number of pseudo-counts to provide pseudo-probabilities to try to account for missing data.
P(i:j)(final) = Wgt(real)*P(i:j)(real) + Wgt(pseudo)*B(i:j)(pseudo)
Thus, P(i:j)(final) is a weighted average of the P(i:j)(real) from the observed counts and the B(i:j) term which is the probability based on the pseudo-counts. The key is in the weights, where:
Wgt(real) = TotalCounts(real) / ( TotalCounts(real) + TotalCounts(pseudo) ) Wgt(pseudo) = 1.0 - Wgt(real)
In the Henikoffs' approach (which is the one implemented here), the TotalCounts(pseudo) is the product of a "fudge factor" (-h number) and the number of unique amino acids (UNIQAA) in the specific row. At the limits: The more diversity in the position, the more diversity will be forced by using a larger fraction of pseudocounts. Thus, the influence of the pseudocounts decreases as the number of sequences increases.
FOR BOTH METHODS, The ambiguous amino acids and bases are created as the weighted average of the final probabilities.
REFERENCES:
- ) M. Gribskov, R. Luethy, D. Eisenberg, 1990. Profile Analysis. Methods in Enzymology 183:146-159.
- ) J.G. Henikoff, S. Henikoff, 1996. Using substitution probabilities to improve position-specific scoring matrices. CABIOS 12:135-143.
OPTIONS
A flag can be set whether this is a protein (-p) or nucleic acid (-n) sequence. If neither flag is set, then if the sequences are in a MSF file the type is taken from the header information. For other files without additional information, it is assumed that these are protein sequences.
What approach should be used to compute position-specific scoring matrix? The choices are: Gribskov's (1) or Henikoff's (2) Method. This is defined with the -a option: "-a Gribskov" or "-a Henikoff"
-
If you choose "-a Gribskov", there are the following options:
| -d | perform data averaging (default is to NOT perform data averaging) |
| -e | use exponential weighting (default is to use linear weighing) |
If you choose "-a Henikoff", there are the following options:
| -h | can change the multiplicative factor for computing the number of pseudo counts per amino acid used (default is to use a factor of 5.0). |
| -u | use the probabilities for the occurance of each amino acid from the counts in the other rows (default is to use the probabilities from PIR). |
Following matrices (-m MATRIX) can be used for a protein sequence:
-
With Gribskov's method, you can use:
Blosum30, Blosum35, Blosum40, Blosum45, Blosum50, Blosum55,
Blosum60, Blosum62, Blosum65, Blosum70, Blosum75, Blosum80,
Blosum85, Blosum90, Blosum95, Blosum100
Pam40, Pam80, Pam120, Pam160, Pam200, Pam250, Pam320
Gribskov's Matrix (PEPCMP)
With Henikoff's method, the substitution frequencies for:
Blosum30, Blosum40, Blosum50, Blosum60, Blosum70, Blosum80, Blosum90
The Nucleic Acid matrices (-m MATRIX) for either method are:
| 1. | "DNAPAM" | straight DNA PAM |
| 2. | "DNATT" | DNA with Transistions/Transversions |
| 3. | "PSEUDO" | based on DNA pseudogenes |
at the following distances:
- 1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150
An example would be: -m dnatt60
You can also use Gribskov's matrix with his Approach.
Which gap penalty model should be used:
- -g Gribskov
- use the Gribskov gap length dependent model
For columns with gaps, this is of the form:
gap_multiplier = Gmax / (1.0 + Ginc * Lgap )
where:- Gmax is the maximum multiplier for where
there are gaps
Ginc determines the rate at which the gap multiplier decreases with increasing gap penalty
Lgap is the length of the gap
- -g Linear
- use a linear weighing based on the number of
gaps
this approach assigns a different value for open gap and extend gap. - -g MinMax
- assigns a gap penalty based on log-odds scores. Specifically, the multipler is 80% of biggest or twice the absolute value of the smallest logodds score, whichever is larger.
The region needed for the profile:
- -r start end
- the portion of the multiple sequence alignment that you want the profile from.
Lastly, define the input, output and weight files:
| -i | InFile |
| -o | OutFile |
| -w | WeightFile |
The InFile is required. If the OutFile is NOT defined, then the program will create an OutFile name as the root of the InFile with the extension '.prf'.
- ( NucleicAcid.msf -> NucleicAcid.prf )
The default approach is Henikoff.
The defaults for the Henikoff Approach are:
-
If Protein:
-
Matrix: Blosum60
Gap: Linear
-
Matrix: DNATT50
Gap: Linear
The defaults for the Gribskov Approach are:
-
If Protein or Nucleic Acid:
-
Matrix: Gribskov
Gap: Gribskov
Weighing: Linear
KNOWN BUGS: EXPONENTIAL WEIGHING FOR GRIBSKOV APPROACH
A difference between this and the GCG package is in the computation of the ambiguous amino and nucleic acids. This program uses the weighted average of the appropriate similarity score for the Gribskov method and the probability (P(i:j)) for the Henikoff method.
USAGE:
MakePSSM -i Protein.msf -a Henikoff -m Blosum60 \ -g Linear
SUMMARY OF OPTIONS FOR MakePSSM:
| -a approach | "Gribskov" or Henikoff" | |
| -d | with "-a Gribskov", data averaging | |
| -e | with "-a Gribskov", exponential weighing | |
| -g gap_model | "Gribskov", "Linear", "MinMax" | |
| -h multiplier | with "-a Henikoff", for weighing pseudo-counts | |
| -i InFile | ||
| -m matrix | substitution or frequency matrix | |
| -n | nucleic acid sequence | |
| -o OutFile | ||
| -p | protein sequence | |
| -r start end | region to create profile from | |
| -u | with "-a Henikoff", compute P(i) from the columns in the profile | |
| -v | with "-a Henikoff -u", do NOT use pseudocounts to smooth out P(i) computed from profile | |
| -w WeightFile | *NOT IMPLEMENTED YET* |