Sequnce Analysis: Which scoring method should I use?

We have modified many of the PSC's sequence analysis programs to select among a wide variety of scoring methods as an easily chosen user option. These programs include the MaxSegs and NWGAP programs, the SP and ST programs and the MSA program. The SP and ST programs are fast database searching programs, similar to the FASTA algorithm, that run on the PSC Cray C-90 computer. These modifications now require the user to select the appropriate scoring method based on his data, rather than to use an arbitrary method that may not be appropriate. In most cases user defined scoring is also an easily chosen option.

This section describes many of the choices available, and helps to recommend different scoring methods for certain cases. Recent research has placed the interpretation of sequence alignment scores on a firmer theoretical footing and provided a sound basis for preferring scores based on scoring tables (matrices) derived similarly to the PAM matrices originally studied by M.O. Dayhoff.

Unitary Scoring Matrices

Early sequence alignment programs used unitary scoring matrix. A unitary matrix scores all matches the same and penalizes all mismatches the same. Although this scoring is sometimes appropriate for DNA and RNA comparisons, for protein alignments using a unitary matrix amounts to proclaiming ignorance about protein evolution and structure. Thirty years of research in aligning protein sequences have shown that different matches and mismatches among the 400 amino acid pairs that are found in alignments require different scores.

Many alternatives to the unitary scoring matrix have been suggested. One of the earliest suggestions was scoring matrix based on the minimum minimum number of bases that must be changed to convert a codon for one amino acid into a codon for a second amino acid. This matrix, known as the minimum mutation distance matrix, has succeeded in identifying more distant relationships among protein sequences than the unitary matrix approach. The minimal mutation distance matrix is an improvement because it incorporates knowledge about the process of generating mutations from one amino acid into another. However it still ignores the processes of selection that determine which mutations will survive in a population.

Another improvement over the unitary matrix is a scoring matrix based on selected physical, chemical, or structural properties Shared and not shared by the 190 pairs of amino acids. Specific instances of this approach work well for some sequences, but not so well for other sequences. The approach works best if the matrix is based upon properties that have been strongly conserved during the evolution of your sequences. This reflects that the properties' matrix attempts to specify the criteria that determine whether or not a mutation can survive and be fixed in a population.

Evolutionary Distances and log-odds scores

The best improvement achieved over the unitary matrix was based on evolutionary distances. Margaret Dayhoff pioneered this approach in the 1970's. She made an extensive study of the frequencies in which amino acids substituted for each other during evolution. The studies involved carefully aligning all of the proteins in several families of proteins and then constructing phylogenetic trees for each family. Each phylogenetic tree was examined for the substitutions found on each branch. This lead to a table of the relative frequencies with which amino acids replace each other over a short evolutionary period.

This table and the relative frequency of occurrence of the amino acids in the proteins studied were combined in computing the PAM (Point Accepted Mutations) family of scoring matrices. Each PAM matrix has a number associated with it. The number is the number of mutations per 100 amino acids. The traditional PAM matrix, the PAM250 matrix, often referred to as the Dayhoff Matrix, assumes the occurrence of 250 point mutations per 100 amino acids or 300 nucleotides in the gene. A particular PAM matrix is most efficient for aligning or finding in a database sequences that have diverged to the extent indicated by the PAM number of the matrix.

The PAM matrices have theoretical advantages over alternative methods of scoring alignments. From a biological point of view PAM matrices are based on observed mutations. Thus they contain information about the processes that generate mutations as well as the criteria that are important in selection and in fixing a mutation within a population. From a statistical point of view PAM matrices, and other log-odds matrices, are the most accurate description of the changes in amino acid composition that are expected after a given number of mutations that can be derived from the data used in creating the matrices. Thus the highest scoring alignment is the statistically most likely to have been generated by evolution rather than by chance. The statistical argument applies strictly only to local alignments (alignment of sub regions rather than entire sequences) without gaps but are probably useful guides for the local alignments with gaps produced by many popular database searching programs. It is not clear to what extent the statistical arguments extend to global alignments.

The PAM matrices and other substitution matrices discussed below are generally presented as log-odds matrices. That is each score is the matrix is the logarithm of an odds ratio. The odds ratio used is the ratio of the number of times residue "A" is observed to replace residue "B" divided by the number of times residue "A" would be expected to replace residue "B" if replacements occurred at random. Thus positive scores in the matrix designate a pair of residues that replace each other more often than expected by chance. This is evidence in favor of the aligned sequences being homologous (that is, related to each other by through common ancestral gene). Negative scores in the matrix designate pairs of residues that replace each other less often than would be expected by chance and are evidence against the sequences being homologous.

You can use the above interpretation of positive and negative scores in log-odds matrices to objectively select groups of amino acids that represent conservative substitutions in proteins. Conservative substitutions are generally defined as amino acid replacements that preserve the structure and functional properties of proteins. The log-odds matrix summarizes the observed replacements that have taken place while conserving the essential properties of many families of proteins. Thus the log-odds matrix provides an empirical, experimental determination of conservative replacements rather than the idiosyncratic definitions commonly used.

Every mutation in a sequence destroys information about the evolutionary relatedness of the sequences. This means that equal levels of support for the conclusion that two sequences are homologous are generated by shorter aligned segments if the PAM matrix used to create the alignment has a lower value. Although it must be kept in mind that the best alignment and therefore the most powerful evidence for homology will result from using the PAM matrix corresponding to the number of mutations required to generate two sequences from their common ancestor. Thus to generate an alignment for publication you should use a PAM matrix that corresponds to the degree of identity observed between the two sequences.

So which PAM matrix should you use for a protein database search? The PAM 120 matrix is probably the most useful if you are only going to use one matrix. A more comprehensive and effective strategy would use more matrices. Complete coverage results from using the PAM40, the PAM120 and the PAM250 matrices while the PAM80 and PAM200 would give good coverage with two matrices.

If your interest is not in database searching, but rather in comparing sequences that you already know are related, different PAM matrices yield better results. If you are limited to a single analysis, use the PAM200 matrix. Alternatively if you make two runs either the PAM80 and PAM250 matrices or the PAM120 and PAM320 matrices will give the best results. Ultimately the best alignment should result from using the PAM matrix corresponding the actual degree of divergence of the pair of sequences.

Amino Acid Substitutions

There have been several recent attempts to construct more modern scoring matrices based on observed amino acid substitutions. The authors of these scoring matrices all report improved results compared to the original PAM matrices created by M.O. Dayhoff. There is no doubt that there is room to improve on the original Dayhoff matrices, both by including more data and by improving the underlying evolutionary model and mathematical methods. However, we accumulated only a limited amount of experience with many of these new scoring matrices and are unsure how much of an improvement they actually represent.

The Blosum family of matrices developed by Steven and Jorja Henikoff are one of these newly developed log-odds scoring matrices with which we have appreciable experience. This experience is both in the form of wide spread practical use and of systematic comparisons of the effectiveness of the matrices. These studies suggest that the Blosum matrices are an improvement over the Dayhoff PAM matrices.

The improved performance of the Blosum matrices probably derives from two main factors. First is that many more protein sequences were know when the Blosum matrices were first derived and thus they incorporate many more observed amino acid substitutions. The second factor is that the observed substitutions used in constructing the Blosum matrices are restricted to those substitutions found within well conserved blocks in a multiple sequence alignment.

Limiting the included substitutions to well conserved blocks yields at least two benefits. First the alignments are most reliable in these blocks and the proportion of false substitutions should be reduced. Perhaps equally important is that these well conserved blocks are the regions most likely to be found in database searches and thus the Blosum matrices represent the most appropriate substitution pattern.

One potential limitation to the Blosum matrices relative to the PAM matrices is that the substitutions are counted within the columns of a multiple sequence alignment rather than along the branches of an evolutionary tree. This could potentially bias the results by over counting some substitutions and under counting others. This has been taken into account by producing a family of matrices. Within each family sequences that are more than N% identical are grouped together and counted as a single sequence. This matrix is then called the Blosum N matrix. Given the effectiveness of these matrices this approach seems to have worked well. As with the PAM matrices it is best to use the matrix that corresponds to the percent identity of the sequences with which you are working. A more detailed discussion can be found in the papers cited below.

Equivalent PAM and Blosum matrices based on relative entropy

  PAM100  ==>    Blosum90
  PAM120  ==>    Blosum80
  PAM160  ==>    Blosum60
  PAM200  ==>    Blosum52
  PAM250  ==>    Blosum45

Another recent advance is that the PAM approach has been applied to nucleic acid sequences. This has lead to two simple suggestions for nucleic acid sequences. First if your sequence is a coding region it is usually better to translate it to an amino acid sequence for database searching. Second, if you must use nucleotides a PAM distance of 47 is a generally appropriate choice of scores. This corresponds to a "match" score of 5 and a "mismatch" score of -4. Suggestion for scoring alignments when transitions and transversion are scored differently are included in the article by States et.al (1991).

Selecting the appropriate scoring method is often problem dependent. For information on selecting a PAM matrix for a particular case, users should consult (Altschul, 1991).

Scoring Matrices References

General discussion of properties of widely used matrices including both PAM and Blosum matrices.
"Amino acid substitution matrices from an information theoretic perspective." Altschul, S.F. 1991 Journal of Molecular Biology 219: 555-665.

PAM matrices for proteins.

"A model of evolutionary change in proteins." Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C. (1978) In "Atlas of Protein Sequence and Structure" 5(3) M.O. Dayhoff (ed.), 345 - 352, National Biomedical Research Foundation, Washington.

PAM matrices for nucleic acids.

"Improved Sensitivity of Nucleic Acid Database Search Using Application-Specific Scoring Matrices" States, D.J., Gish, W., Altschul, S.F. 1991 Methods: A companion to Methods in Enzymology 3(1): 66 - 77.

Blosum (Block sums) matrices.

"Amino acid substitution matrices from protein blocks." Steven Henikoff and Jorja G. Henikoff. 1992 Proc. Natl. Acad. Sci. USA. 89(biochemistry): 10915 - 10919.

Comparison of Amino Acid substitution matrices with visual representation of the differences.

"A Structural Basis of Sequence Comparisons: An evaluation of scoring methodologies." 1993 M.S. Johnson and J.P. Overington. Journal of Molecular Biology. 233: 716 - 738.

"Performance Evaluation of Amino Acid Substitution Matrices." 1993 Steven Henikoff and Jorja G. Henikoff. Proteins: Structure, Function, and Genetics. 17: 49 - 61.