MSA - Multiple Sequence Alignment

A Global Optimal Multiple Sequence Alignment Program

Version 2.1 (PSC revision b)

This document was written by the Pittsburgh Supercomputing Center



  1. Introduction
  2. Program Usage
  3. Optional Parameters
  4. Interpreting Results
  5. Common Problems
  6. References

Introduction

MSA is a global optimal multiple sequence alignment program originally written by John Kececioglu, Stephen Altschul, David Lipman & Robert Miner. Credit for improvements in release 2.0 of the code belong to SK Gupta and AA Schaffer with some guidance from JD Kececioglu.

MSA utilizes a variant of a multi-dimensional dynamic programming to produce an optimal global alignment between several sequences. MSA utilizes advanced projective geometry techniques to greatly reduce the amount of space required to solve the problem in an optimal way. Generally speaking, MSA will produce better alignments than most multiple sequence alignment programs such as Clustal or Pileup. The drawback with using MSA is that it requires an enormous amount of both computer time and memory to align more than a few distantly related sequences. However, we have been able to use MSA to optimally align 20 Phospholipase A2 sequences (approximately 130 residues), 14 Cytochrome C sequences (approximately 110 residues), 6 Aspartal proteases (approximately 350 residues), and 8 Lipid binding proteins (approximately 480 residues) on our supercomputers. All of these problems approached the limits of the problems that can be solved optimally by the MSA program. The size of the problems solved by MSA are directly related to the sequence lengths, the number of sequences, and the amount of sequence diversity. At the PSC we have three versions of MSA compiled for problems of different sizes:

  • msa_50_150 - Will align no more than 50 sequences. Each sequence is less than 150 residues.
  • msa_25_500 - Will align no more than 25 sequences. Each sequence is less than 500 residues.
  • msa_10_1000 - Will align no more than 10 sequences. Each sequence is less than 1000 residues.

It is important that you understand how the MSA program space reduction algorithm works for you to select an appropriate problem and to reasonably set the program parameters. First, lets look at aligning two sequences using dynamic programming. In a straightforward implimentation, this problem would require memory proportional to the lengths of the sequences. Thus if sequence A had length L and sequence B had length M this problem could be solved using N x M cells of memory. For example if sequence A was 8 residues long and sequence B was 5 residues long, 40 memory cells would be needed:


                             Sequence  A
                     1   2   3   4   5   6   7   8           
            S       -------------------------------
            e    1 |   |   |   |   |   |   |   |   |
            q      |---|---|---|---|---|---|---|---|
            u    2 |   |   |   |   |   |   |   |   |
            e      |---|---|---|---|---|---|---|---|
            n    3 |   |   |   |   |   |   |   |   |
            c      |---|---|---|---|---|---|---|---|
            e    4 |   |   |   |   |   |   |   |   |
                   |---|---|---|---|---|---|---|---|
            B    5 |   |   |   |   |   |   |   |   |
                    -------------------------------

If we were to align 3 residue third sequence, sequence C, with the original two sequences we would need 8x5x3=120 memory cells:

                                     Sequence  A
                           1   2   3   4   5   6   7   8    
                          -------------------------------      
                      3 /   /   /   /   /   /   /   /   /| 
                       /---/---/---/---/---/---/---/---/ |     
       Sequence C   2 /   /   /   /   /   /   /   /   /|/|
                     /---/---/---/---/---/---/---/---/ | |    
                  1 /   /   /   /   /   /   /   /   /|/|/| 
            S      |-------------------------------| | | |
            E    1 |   |   |   |   |   |   |   |   |/|/|/|
            Q      |---|---|---|---|---|---|---|---| | | |
            U    2 |   |   |   |   |   |   |   |   |/|/|/|
            E      |---|---|---|---|---|---|---|---| | | |
            N    3 |   |   |   |   |   |   |   |   |/|/|/
            C      |---|---|---|---|---|---|---|---| | | 
            E    4 |   |   |   |   |   |   |   |   |/|/ 
                   |---|---|---|---|---|---|---|---| |
            B    5 |   |   |   |   |   |   |   |   |/
                    ------------------------------- 

If we had a good global alignment between two related sequences, the memory cells that contain this alignment would be near the center diagonal of the two dimensional grid. Likewise if we had a good global alignment between three related sequences we would expect its alignment path to make use of the memory cells near the center of the cube.

MSA restricts the amount of memory by computing bounds that approximate the center of a multi-dimensional hypercube. The first bound is producing by computing pairwise alignments between the set of sequences. Weights are usually applied to this value to produce the lower bound used by the program. Next a heuristic alignment is produced for the sequences. Weights are usually applied to this value to produce the upper bound used by the program. A delta value is then computed to be the difference between these two values. The epsilon values shown by the program is the computed delta value broke down per pairwise alignment. To produce good optimal alignments, epsilon and delta are the two most important parameters that you need to pay attention to. The delta and epsilon values are preliminary measures of the divergence between the set of sequences. Thus, closely related sequences will have low epsilons and deltas while distantly related sequences will have high epsilons and deltas.

For best results running the program, first produce only a heuristic alignment with your set of sequences. Examine the alignment and the epsilons. If you are only using a few short sequences and the computed epsilons are relatively low (epsilons < 50) then the sequences are closely related. Continue on to produce an optimal alignment. If you are using a more complicated set of sequences, a ramping strategy is suggested. First, align three of your sequences optimally, than four, then five ... etc., until you exceed 32Mw of memory. Then double, triple, or quadruple the memory requirements and add one last sequence. You may or may not be able to get this last sequence to align. If you are dealing with long sequences, you may want to divide your sequences into two or three subregions, and align the subregions sepparately.

Program Usage

msa [-m] [-g] [-b] [-o] [-d<int>] [-a<int>] [-e <filename>] [-H <filename>][-O <filename>] [-c <filename>] [-f <filename>] <input filename>

Optional Parameters

Many options can be specified with the MSA program. Below is a list of the options availiable to the MSA code installed at the PSC.

-m
Turns off the optimal multiple alignment segment of the program. This allows the user to see the heuristic alignment and other data produced by the program before producing an optimal multiple alignment.

-g
Charges terminal gaps the same as internal gaps. As a default, no charge is made for the existance of a terminal gap.

-b
The cost of a multiple alignment is taken to be the unweighted sum of all the pairwise alignments. In the absense of this flag, the program estimates an evolutionary tree and uses it to assign weights to each pairwise alignment using either rationale-1 or rational-2 as described in Altschul et al., J. Molec. Biol. 208 (1989). Which of these rationales is used depends on a flag set in "defs.h" at compilation time. (At the PSC rationale-1 weights are used.)

-o
Suppresses output to standard error concerning the progress of the program.

-d<delta>
Specifies the maximum delta value explained in the introduction. The default delta value is computed as the difference between the (weighted) heuristic alignment cost and the (weighted) pairwise alignment costs. This is specified by an integer value. Note that there is not a space between the -d and the value.

-a<epsilon increment>
Specifies the epsilon increment. (See the introduction for more information on the epsilons). Use of this option will add a constant integer value to all of the epsilons computed by the program. Note that there is not a space between the -a and the increment value. (PSC ADDED OPTION)

-e <epsilon file>
User specified epsilons for each pairwise alignment. As a default, the program calculates the epsilons. Frequently the "optimal multiple alignment" will be found to have observed epsilons exceeding those supplied or calculated. When this is the case, it is advisable to rerun the program using suitably augmented epsilons. The file named here should have integers separated by spaces or newlines or both, with one integer for each pair of sequences in the order 1-2, 1-3, ... , 1-N, 2-3, ... , (N-1)-N.

-c <cost file>
Allows the user to specify the cost for a gap, as well as the cost for aligning any pair of letters or a letter with a null. The default is PAM-250 costs for protein sequences, using the one-letter code. The format of this file is an integer, followed by all possible pairs of alligned symbols followed by their cost. For example, the file might begin as follows:

	3
	- - 0
	- A 1
	A C 2
 

This would specify a cost of 0 for aligning a null symbol with another null symbol, a cost of 1 for aligning an A with a null symbol, etc., and an additional cost of 3 for the existence of a gap. The program assumes the costs are symmetric, so that there is no need to have a line for C A as well as for A C. All costs must be non-negative integers.

-H <msf file>
This option will cause the program to write the heuristic alignment into the GCG MSF file format. (PSC ADDED OPTION)

-O <msf file>
This option will cause the program to write the optimal alignment into the GCG MSF file format. (PSC ADDED OPTION)

-f <fixer file>
Allows the user to force the alignment of certain residues. The file referred to must have one or more lines of the following format:

seqs.| "S" precedes block start | "L" precedes block length

For example, the line:

 
	2 3 5  	S 22 21 25 	L 10  	S 35 36 41 	L 1

would force positions 22 to 31 of sequence 2 to be aligned with positions 21 to 30 of sequence 3 and positions 25 to 34 of sequence 5; it would alse force position 35 of sequence 2 to be aligned with position 36 of sequence 3 and position 41 of sequence 5. Needless to say, all positions forced into alignment must be mutually consistent.

<filename>
This is a file containing the sequences to be aligned. Sequences may be in the GenBank, EMBL/Swiss-Prot, NBRF-PIR or Fasta file formats. (PSC ENHANCED OPTION)

Interpreting Results

Below is a sample run, computing an optimal alignment along with an explanation of the results:

MSA release 2.1  (PSC revision b) started on Thu Jun 19 14:55:31 1997
Sequence file format is Fasta.
Calculating pairwise alignments.
..........
**********
Calculating weights.

[EXPLANATION]

---------------- Tree given from ancestor ----------------
On the left:   Internal Node  Distance to parent =  278.83
On the left:   Internal Node  Distance to parent =   23.63
On the left:   Internal Node  Distance to parent =  118.62
On the left:   SEQ#01         Distance to parent =  230.50
On the right:  SEQ#04         Distance to parent =  205.50
On the right:  SEQ#05         Distance to parent =  238.37
On the right:  SEQ#02         Distance to parent =  256.17
On the right:  SEQ#03         Distance to parent =    0.00

Calculating epsilons.

[EXPLANATION]

Sequence     ID       Description
   1       SEQ#01     P1;1POA Phospholipase a2 (EC 3.1.1.4) - Chinese cobra
   2       SEQ#02     P1;1POD Phospholipase a2 (EC 3.1.1.4) - human
   3       SEQ#03     P1;1PPA Phospholipase a2 (EC 3.1.1.4) lys 49 variant 
   4       SEQ#04     P1;1BPQ phospholipase A2 (EC 3.1.1.4) mutant (K56M) -
   5       SEQ#05     P1;1PP2R phospholipase A2 (EC 3.1.1.4) (calcium-free)

[EXPLANATION]


                 ***  Heuristic Multiple Alignment  ***

**********23541      ********************************35214 **35214      ***
NLYQFKNMIQCTVPSR-SWWDFADYGCYCGRGGSGTPVDDLDRCCQVHDNCYNEAEKISGC-----WPYFKTYSY
NLVNFHRMIK-LTTGKEAALSYGFYGCHCGVGGRGSPKDATDRCCVTHDCCYKRLEK-RGC-----GTKFLSYKF
SVLELGKMIL-QETGKNAITSYGSYGCNCGWGHRGQPKDATDRCCFVHKCCYKKLT---DC-----NHKTDRYSY
ALWQFNGMIKCKIPSSEPLLDFNNYGCYCGLGGSGTPVDDLDRCCQTHDNCYKQAMKLDSCKVLVDNPYTNNYSY
SLVQFETLIM-KIAGRSGLLWYSAYGCYCGWGGHGLPQDATDRCCFVHDCCYGKAT---DC-----NPKTVSYTY

*********35214 *****************14325                     
ECSQGTLTCKGGNNACAAAVCDCDRLAAICFAG--APYNDNDYNINLKARC-------
SNSGSRITC-AKQDSCRSQLCECDKAAATCFARNKTTYNKKYQYYS-NKHCRGSTPRC
SWKNKAIIC-EEKNPCLKEMCECDKAVAICLRENLDTYNKKYKAYF-KLKCKKPDT-C
SCSNNEITCSSENNACEAFICNCDRNAAICFSK--VPYNKEHKNLD-KKNC-------
SEENGEIIC-GGDDPCGTQICECDKAAAICFRDNIPSYDNKYWLFP-PKDCREEPEPC

Calculating pairwise projection costs.
..........
**********

Calculating multiple alignment.
....1....2....3....4....5....6....7....8....9....0
**************************************************

[EXPLANATION]

                  ***  Optimal Multiple Alignment  ***

NLYQFKNMIQCTVPSR-SWWDFADYGCYCGRGGSGTPVDDLDRCCQVHDNCYNEAEKISGC-----WPYFKTYSY
NLVNFHRMIK-LTTGKEAALSYGFYGCHCGVGGRGSPKDATDRCCVTHDCCYKRLEK-RGC-----GTKFLSYKF
SVLELGKMIL-QETGKNAITSYGSYGCNCGWGHRGQPKDATDRCCFVHKCCYKKL---TDC-----NHKTDRYSY
ALWQFNGMIKCKIPSSEPLLDFNNYGCYCGLGGSGTPVDDLDRCCQTHDNCYKQAMKLDSCKVLVDNPYTNNYSY
SLVQFETLIM-KIAGRSGLLWYSAYGCYCGWGGHGLPQDATDRCCFVHDCCYGKA---TDC-----NPKTVSYTY

ECSQGTLTCKGGNNACAAAVCDCDRLAAICFAG--APYNDNDYNINLKARC-------
SNSGSRITC-AKQDSCRSQLCECDKAAATCFARNKTTYNKKYQYYS-NKHCRGSTPRC
SWKNKAIIC-EEKNPCLKEMCECDKAVAICLRENLDTYNKKYKAYF-KLKCK-KPDTC
SCSNNEITCSSENNACEAFICNCDRNAAICFSK--VPYNKEHKNLD-KKNC-------
SEENGEIIC-GGDDPCGTQICECDKAAAICFRDNIPSYDNKYWLFP-PKDCREEPEPC

[EXPLANATION]

End gaps not penalized.
Costfile:                   pam250
Alignment cost:   35132     Lower bound:   34945
Delta:              187     Max. Delta:      285

Sequences  Proj. Cost  Pair. Cost  Epsilon  Max. Epsi.  Weight  Weight*Cost
  1   2        1864        1825       39        39         1        1864
  1   3        1891        1843       48        57         1        1891
  1   4        1654        1653        1         5         4        6616
  1   5        1814        1787       27        28         2        3628
  2   3        1735        1733        2         8         4        6940
  2   4        1876        1866       10        10         1        1876
  2   5        1713        1712        1         8         2        3426
  3   4        1901        1889       12        21         1        1901
  3   5        1648        1648        0        11         2        3296
  4   5        1847        1842        5         6         2        3694
Elapsed time =   0.895
Tree
A tree is given for the heuristic alignment. The tree illustrated corresponds to:

                             SEQ#03
                              /
                             /
                        278-/
                           /
                          /
                         *
                     23-/ \
                       *   \
                  118-/ \   \-256
                     /   \   \ 
                    * 238-\   \
                   / \     \ SEQ#02  
              230-/   \   SEQ#05
                 / 205-\
                /       \
             SEQ#01    SEQ#04

Sequences
These are the sequences being aligned. The order presented here is the same order presented in the alignments.

Heuristic Alignment
This is the heuristic alignment computed by the program. The heuristic alignment is produced by using the consistant positions with all of the the pairwise alignments in conjunction with a progressive pairwise alignment technique across the inconsistant positions. The * at the top of a column indicates the column is consistant with the pairwise alignments at that particular column. Numbers and spaces indicate inconsistant positions.

Optimal Alignment
This is the optimal alignment computed by the program. This alignment is usually (but not always) different from the heuristic alignment.

Parameters and Values
The parameters and values computed by the program are shown here. When an epsilon value exceeds the maximum epsilon value, the program should be rerun using larger epsilons.

Common Problems

As the MSA program is a complicated program, many problems can occour. These errors are usually reported in your cray log file (sometimes called a .CPR file.) Please examine your log file for every MSA run! Some common errors are descrbed below along with their not-so-obvious causes and solutions:

Memory Errors

not enough space
You did not request enough memory to start the MSA program. Increase the memory parameters

Cannot allocate memory.
You did not request enough memory to run the MSA program. Before you do anything, take a look at your sequences:

  • Are you aligning long sequences? Sequences that are greater than 300 residues will require a larger amount of memory to align.

  • Are you aligning many sequences? Keep in mind that the memory requirements can grow dramatically (double, triple, quadruple, etc.) when more sequences are added to the equation.

  • Are you performing an optimal alignment? This memory error is common when performing an optimal alignment and rare when only performing a heuristic alignment. If you get this error, first get a sucessful heuristic alignment and examine the computed epsilons.

  • Are your sequences VERY diverse? Your sequences may be too diverse to be aligned optimally. If this is the cause, first get a sucessful heuristic alignment and examine the computed epsilons.

After you have identified the cause of the problem, there may be a way that you can cope with this error. The quickest 'fix' may be to simply increase the memory parameters. However if you are already using over 32Mw of memory and are still getting this error, a different strategy is indicated. First, examine the epsilons. If a single sequence is causing high epsilons to be computed, try rerunning the problem without this sequence. If the smaller problem can be aligned, than you may need to double, triple, or quadruple the memory requirements to get the entire set of sequences to align. On the other hand, the problem may be complicated enough to require more resources than are availiable on our supercomputers. If a single sequence is not the cause, then a ramping strategy is suggested. First, align three of your sequences optimally, than four, then five ... etc., until you exceed 32Mw of memory. Then double, triple, or quadruple the memory requirements and add one last sequence. You may or may not be able to get this last sequence to align. If you are dealing with long sequences, you may want to divide your sequences into two or three subregions, and align the subregions separately.

Alignment Errors

Heuristic alignment segment is longer than XXX
Your sequences are very diverse!. See if you can eliminate the most diverse sequence and try again.

Multiple alignment within bound does not exist.
This is a common error. It is produced because the optimal alignment can not be found within the space allocated. A larger search space is needed to find the optimal alignment. Remember that two parameters, the epsilons and the delta are used to increase the search space. You may want to try increasing the epsilons using the epsilon increment option. or you may want to re-run the program using a larger delta. A third option is to compute a heuristic alignment, examine the computed epsilons, and enter your own epsilons into the program. Keep in mind that increasing the epsilons and the delta will cause the program to use more memory.

No alignment is produced
Usually no alignment is produced when you exeed the job or process CPU time limits.

Miscellaneous Errors

CPU time limit exceeded
The solution to this problem usually is to increase the time limit. However, if you are computing an optimal alignment, first look at the optimal alignment tick line. The asterics indicate approximately how hong the program will need to be run. If you made it past 5 then double the time requirements. If you only made it past 3.5 then triple the time requirements. Please use caution! You may want to first follow the strategies indicated under Cannot allocate memory.

Cannot open XXX
This message means that a filename could not be opened.

Block starts do not equal sequence number
This message means that the program did not read in the fixer file correctly. Sometimes the program misinterprets spaces found at the end of the fixer line as a block start point.

Epsilon must be positive
The epsilon value that was read into the program was negative. Epsilons must be positive.

Cannot exceed XXX sequences.
The number of sequences that can be aligned is a compile-time option. If you need to have the program readjusted, please contact the PSC hotline.

Cannot exceed XXX characters per sequence.
The number of residues per sequence is a compile-time option. If you need to have the program readjusted, please contact the PSC hotline.

Sequences are in an unknown file format
The file containing the sequences must be in either GenBank, EMBL/Swiss-Prot, NBRF-PIR or Fasta file formats. Please check your sequence file.

References

MSA Release 1.x

John Kececioglu, Stephen Altschul, David Lipman & Robert Miner

Please send correspondence to: altschul@tech.nlm.nih.gov or
Stephen F. Altschul
National Center for Biotechnology Information
National Library of Medicine
Bldg. 38A, Room 8N-811
Bethesda, MD 20894
(301) 496-2475

MSA Release 2.x

S. Gupta, J. Kececioglu & A. Schaffer

The primary distribution site for MSA is:
fastlink.nih.gov

The MSA mailing list is now officially:
msa-list@fastlink.nih.gov

Alejandro Schaffer
schaffer@helix.nih.gov

  • Carrillo & Lipman, "The Multiple Sequence Alignment Problem in Biology", SIAM J. Appl. Math. 48 (1988) 1073-1082;

  • Altschul & Lipman, "Trees, Stars, and Multiple Biological Sequence Alignment", SIAM J. Appl. Math. 49 (1989) 197-209;

  • Altschul, "Gap Costs for Multiple Sequence Alignment", J. Theor. Biol. 138 (1989) 297-309;

  • Altschul, Carroll & Lipman, "Weights for Data Related by a Tree", J. Molec. Biol. 207 (1989) 647-653;

  • Altschul, "Leaf Pairs and Tree Dissections", SIAM J. Discrete Math. 2 (1989) 293-299;

  • Lipman, Altschul & Kececioglu, "A Tool for Multiple Sequence Alignment", Proc. Natl. Acad. Sci. USA 86 (1989) 4412-4415.

  • Gupta, Kececioglu & Schaffer, "Improving the Practical Time and Space Efficiency of the Shortest-Paths Approach to Sum-of-Pairs Multiple Sequence Alignment", J. Computational Biology 2(1995) 459-472.



Pittsburgh Supercomputing Center National Resource for Biomedical Supercomputing
300 S. Craig St. Pittsburgh, PA 15213. Phone: 412-268-4960, Email: biomed@psc.edu

Please send suggestions for improving this code and error reports to biomed@psc.edu.

biomed-www@psc.edu (last updated: 10/22/97) © 1997 Pittsburgh Supercomputing Center