BIOMEDICAL INITIATIVE 		PITTSBURGH SUPERCOMPUTING CENTER 		                
4 -  1


Clustal - Multiple Sequence Alignments.
Documentation  --  Usage
Des Higgins 	

European Molecular Biology Laboratory 

Postfach 10.2209 D-6900 Heidelberg  Germany.
higgins@EMBL-Heidelberg.DE

Note: This documentation has been modified at the Pittsburgh
Supercomputing Center by removing the material on installing the
Clustal software on different computers.  Those instructions are
available if you need them.  It is best if you obtain a complete copy
of of the most recent version of the program and documentation from
its authors if you want to install it on your own computer.

Contents.
 	1		Overview
	2		Interactive usage
	3		Command-line interface
	4		Algorithms and references


1.  Overview

This document describes how to install and use ClustalV on various
machines. ClustalV is a complete upgrade and rewrite of the Clustal
package of multiple alignment programs (Higgins and Sharp, 1988 and
1989).  The original programs were written in Fortran for
microcomputers running MSDOS.  You carried out a complete alignment by
running 3 programs in succession.  Later, these were merged into a
single menu driven program with on-line help, for VAX/VMS.  ClustalV
was written in C and has all of the features of the old programs plus
many new ones.  It has been compiled and tested using VAX/VMS C,
Decstation ULTRIX C, Gnu C for Sun workstations, Turbo C for IBM PC's
and Think C for Apple Mac's.  The original Clustal was written by Des
Higgins while he was a Post-Doc in the lab of Paul Sharp in the
Genetics Department, Trinity College, Dublin 2, Ireland.

The main feature of the old package was the ability to carry out
reliable multiple alignments of many sequences.  The sensitivity of
the program is as good as from any other program we have tried, with
the exception of the programs of Vingron and Argos (1991), while it
works in reasonable time on a microcomputer.  The programs of Vingron
and Argos are specialised for finding distant similarities between
proteins but require mainframes or workstations and are more difficult
to use.

The main new features are: profile alignments (alignments of old
alignments); phylogenetic trees (Neighbor Joining trees calculated
after multiple alignment with a bootstrapping option); better sequence
input (automatically recognise and read NBRF/PIR, Pearson (Fasta) or
EMBL/SwissProt formats); flexible alignment output (choose one of: old
Clustal format, NBRF/PIR, GCG msf format or Phylip format); full
command line interface (everything that you can do interactively can
be specified on the command line).

In version 7 of the GCG package, there is a program called PILEUP
which uses a very similar algorithm to the one in ClustalV.  There are
2 main differences between the programs: 1) the metric used to compare
the sequences for the initial "guide tree" uses a full global, optimal
alignment in PILEUP instead of the fast, approximate ones in ClustalV.
This makes PILEUP much slower for the comparison of long sequences.
In principle, the distances calculated from PILEUP will be more
sensitive than ours, but in practice it will not make much difference,
except in difficult cases.  2) During the multiple alignment, terminal
gaps are penalised in ClustalV but not in PILEUP. This will make the
PILEUP alignments better when the sequences are of very different
lengths (has no effect if there are no large terminal gaps).
 
This software may be distributed and used freely, provided that you do
not modify it or this documentation in any way without the permission
of the authors.

If you wish to refer to ClustalV, please cite: Higgins,D.G.
Bleasby,A.J. and Fuchs,R. (1991) CLUSTAL V: improved software for
multiple sequence alignment. CABIOS, vol .8, 189-191.

The overall multiple alignment algorithm was described in:
Higgins,D.G. and Sharp,P.M. (1989).  Fast and sensitive multiple
sequence alignments on a microcomputer.  CABIOS, vol. 5, 151-153.
 
ACKNOWLEDGEMENTS.

D.H. would particularly like to thank Paul Sharp, in whose lab. this
work originated.  We also thank Manolo Gouy, Gene Myers, Peter Rice
and Martin Vingron for suggestions, bug-fixes and help.  Des Higgins
and Rainer Fuchs, EMBL Data Library, Heidelberg, Germany.  Alan
Bleasby, Daresbury, UK.
JUNE 1991 


2.  Interactive usage.
     Running Clustal at the Pittsburgh Supercomputing Center.
         I.  On the VAX/VMS systems - Interactive use

             A.  Enter the command:  Setup sequence

                    This allows you access to the Clustal W program
along with several other programs useful for sequence analysis.  It
also allows on-line access to the Clustal documentation.  This access
is described in the message that is displayed in response to this
command.

             B.  Enter the command:  Clustalw

                    This starts the program in interactive mode and
displays the first menu on your screen.  This and other menus are
described below.
 
Interactive usage of Clustal V is completely menu driven.  On-line
help is provided, defaults are offered for all parameters and file
names.  With a little effort it should be completely self explanatory.
The main menu, which appears when you run the programs is shown below.
Each item brings you to a sub menu.
  
Main menu for Clustal V:
 
     1. Sequence Input From Disc
     2. Multiple Alignments
     3. Profile Alignments
     4. Phylogenetic trees
     S. Execute a system command
     H. HELP
     X. EXIT (leave program)
 
Your choice:
 
The options S and H appear on all the main menus.  H will provide help
and if you type S you will be asked to enter a command, such as DIR or
LS, which will be sent to the system (does not work on Mac's).  Before
carrying out an alignment, you must use option 1 (sequence input); the
format for sequences is explained below.  Under menu item 2 you will
be able to automatically align your sequences to each other.  Menu
item 3 allows you to do profile alignments.  These are alignments of
old alignments.  This allows you to build up a multiple alignment in
stages or add a new sequence to an old alignment.  You can calculate
phylogenetic trees from alignments using menu item 4.
 
 
 
SEQUENCE INPUT 

All sequences should be in 1 file.  Three formats are automatically
recognised and used: NBRF/PIR, EMBL/SwissProt and FASTA (Pearson and
Lipman (1988) format).

*** Users of the Wisconsin GCG package should use the command TONBRF
(recently changed to TOPIR) to reformat their sequences before use.
***

Sequences can be in upper or lower case.  For proteins, the only
symbols recognised are: A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y and
for DNA/RNA use: A,C,G and T (or U).  Any other letters of the
alphabet will be treated as X (proteins) or N (DNA/RNA) for unknown.
All other symbols (blanks, digits etc.) will be ignored EXCEPT for the
hyphen "-" which can be used to specify a gap.  This last point is
especially useful for 2 reasons: 1) you can fix the positions of some
gaps in advance; 2) the alignment output from this program can be
written out in NBRF format using "-"'s to specify gaps; these
alignments can be used again as input, either for profile alignments
or for phylogenetic trees.

If you are using an editor to create sequence files, use the FASTA
format as it is by far the simplest (see below).  If you have access
to utility programs for generating/converting the NBRF/PIR format then
use it in preference.
 
 
FASTA (PEARSON AND LIPMAN, 1988) FORMAT: The sequences are delimited
by an angle bracket ">" in column 1.  The text immediately after the
">" is used as a title.  Everything on the following line until the
next ">" or the end of the file is one sequence.

e.g.
> RABSTOUT   rabbit Guinness receptor      
   LKMHLMGHLKMGLKMGLKGMHLMHLKHMHLMTYTYTTYRRWPLWMWLPDFGHAS    
   ADSCVCAHGFAVCACFAHFDVCFGAVCFHAVCFAHVCFAAAVCFAVCAC 
> MUSNOSE   mouse nose drying factor     
   mhkmmhkgmkhmhgmhmhglhmkmhlkmgkhmgkmkytytytryrwtqtqwtwyt     
   fdgfdsgafdagfdgfsagdfavdfdvgavfsvfgvdfsvdgvagvfdv 
> HSHEAVEN    human Guinness receptor repeat  
 mhkmmhkgmkhmhgmhmhg   lhmkmhlkmgkhmgkmk  ytytytryrwtqtqwtwyt   
 fdgfdsgafdagfdgfsag   dfavdfdvgavfsvfgv  dfsvdgvagvfdv
 mhkmmhkgmkhmhgmhmhg   lhmkmhlkmgkhmgkmk  ytytytryrwtqtqwtwyt   
 fdgfdsgafdagfdgfsag   dfavdfdvgavfsvfgv  dfsvdgvagvfdv
 
 
NBRF/PIR FORMAT is similar to FASTA format but immediately after the
">", you find the characters "P1;" if the sequences are protein or
"DL;" if they are nucleic acid.  Clustalv looks for the ";" character
as the third character after the ">".  If it finds one it assumes that
the format is NBRF if not, FASTA format is assumed.  The text after
the ";" is treated as a sequence name while the entire next line is
treated as a title.  The sequence is terminated by a star "*" and the
next sequence can then begin (with a >P1; etc ).  This is just the
basic format description (there are other variations and rules).

ANY files/sequences in GCG format can be converted to this format
using the TONBRF command (now TOPIR) of the Wisconsin GCG package.

e.g.
>P1;RABSTOUT rabbit Guinness receptor
LKMHLMGHLKMGLKMGLKGMHLMHLKHMHLMTYTYTTYRRWPLWMWLPDFGHAS
ADSCVCAHGFAVCACFAHFDVCFGAVCFHAVCFAHVCFAAAVCFAVCAC* 

>P1;MUSNOSE mouse nose drying factor
mhkmmhkgmkhmhgmhmhglhmkmhlkmgkhmgkmkytytytryrwtqtqwtwyt
fdgfdsgafdagfdgfsagdfavdfdvgavfsvfgvdfsvdgvagvfd * 

>P1;HSHEAVEN human Guinness receptor repeat protein.
mhkmmhkgmkhmhgmhmhg lhmkmhlkmgkhmgkmk ytytytryrwtqtqwtwyt
fdgfdsgafdagfdgfsag dfavdfdvgavfsvfgv dfsvdgvagvfdv

mhkmmhkgmkhmhgmhmhg lhmkmhlkmgkhmgkmk ytytytryrwtqtqwtwyt
fdgfdsgafdagfdgfsag dfavdfdvgavfsvfgv dfsvdgvagvfdv*
 
EMBL/SWISSPROT FORMAT: Do not try to create files with this format
unless you have utilities to help.  If you are just using an editor,
use one of the above formats.  If you do use this format, the program
will ignore everything between the ID line (line beginning with the
characters "ID") and the SQ line.  The sequence is then read from
between the SQ line and the "//" characters.
 
It is critically important for the program to know whether or not it
is aligning DNA or protein sequences.  The input routines attempt to
guess which type of sequence is being used by counting the number of
A,C,G,T or U's in the sequences.  If the total is more than 85% of the
sequence length then DNA is assumed.  If you use very bizarre
sequences (proteins with really strange aa compositions or DNA
sequences with loads of strange ambiguity codes) you might confuse the
program.  It is difficult to do but be careful.

MULTIPLE ALIGNMENT MENU.

The multiple alignment menu is shown below.  Before explaining how to
use it, you must be introduced briefly to the alignment strategy.  If
you do not follow this, try using option 1 anyway; the entire process
will be carried out automatically.

To do a complete multiple alignment, we need to know the approximate
relationships of the sequences to each other (which ones are most
similar to each other).  We do this by calculating a crude
phylogenetic tree which we call a dendrogram (to distinguish it from
the more sensitive trees available under the phylogenetic tree menu).
This dendrogram is used as a guide to align bigger and bigger groups
of sequences during the multiple alignment.  

The dendrogram is calculated in 2 stages:

1) all pairs of sequence are compared using the fast/approximate
method of Wilbur and Lipman (1983); the result of each comparison is a
similarity score.

2) the similarity scores are used to construct the dendrogram using
the UPGMA cluster analysis method of Sneath and Sokal (1973).

The construction of the dendrogram can be very time consuming if you
wish to align many sequences (e.g.  for 100 sequences you need to
carry out 100x99/2 sequence comparisons =3D 4950). During every multiple
alignment, a dendrogram is constructed and saved to a file
(something.dnd).  These can be reused later.
 
******Multiple*Alignment*Menu******
 
     1.  Do complete multiple alignment now
     2.  Produce dendrogram file only
     3.  Use old dendrogram file
     4.  Pairwise alignment parameters
     5.  Multiple alignment parameters
     6.  Output format options
     S.  Execute a system command
     H.  HELP     or press [RETURN] to go back to main menu
 
Your choice:
 
So, if in doubt, and you have already loaded some sequences from the
main menu, just try option 1 and press the  key in response to
any questions.  You will be prompted for 2 file names e.g. if the
sequence input file was called DRINK.PEP, you will be offered
DRINK.ALN as the file to contain the alignment and DRINK.DND for the
dendrogram.

If you wish to repeat a multiple alignment (e.g. to experiment with
different gap penalties) but do not wish to make a dendrogram all over
again use menu item 3 (providing you are using the same sequences).
Similarly, menu item 2 allows you to produce the dendrogram file only.
 
PAIRWISE ALIGNMENT PARAMETERS:

The parameters that control the initial fast/approximate comparisons
can be set from menu item 4 which looks like:
 
 ********* WILBUR/LIPMAN PAIRWISE ALIGNMENT PARAMETERS *********
 
      1. Toggle Scoring Method  :Percentage
      2. Gap Penalty            :3
      3. K-tuple                :1
      4. No. of top diagonals   :5
      5. Window size            :5
      H. HELP
 
Enter number (or [RETURN] to exit):
 
 
The similarity scores are calculated from fast alignments generated by
the method of Wilbur and Lipman (1983).  These are 'hash' or 'word' or
'k-tuple' alignments carried out in 3 stages.

First you mark the positions of every fragment of sequence, K-tuple
long (for proteins, the default length is 1 residue, for DNA it is 2
bases) in both sequences.  Then you locate all k-tuple matches between
the 2 sequences.  At this stage you have to imagine a dot- matrix plot
between the 2 sequences with each k-tuple match as a dot.  You find
those diagonals in the plot with most matches (you take the "No. of
top diagonals" best ones) and mark all diagonals within "Window size"
of each top diagonal.  This process will define diagonal bands in the
plot where you hope the most likely regions of similarity will lie.

The final alignment stage is to find that head to tail arrangement of
k-tuple matches from these diagonal regions that will give the highest
score.  The score is calculated as the number of exactly matching
residues in this alignment minus a "gap penalty" for every gap that
was introduced.  When you toggle "Scoring method" you choose between
expressing these similarity scores as raw scores or expressed as a
percentage of the shorter sequence length.

K-TUPLE SIZE: Can be 1 or 2 for proteins; 1 to 4 for DNA.  Increase
this to increase speed; decrease to improve sensitivity.

GAP PENALTY: The number of matching residues that must be found in
order to introduce a gap.  This should be larger than K-Tuple Size.
This has little effect on speed or sensitivity.

NO. OF TOP DIAGONALS: The number of best diagonals in the imaginary
dot-matrix plot that are considered.  Decrease (must be greater than
zero) to increase speed; increase to improve sensitivity.

WINDOW SIZE: The number of diagonals around each "top" diagonal that
are considered.  Decrease for speed; increase for greater sensitivity.

SCORING METHOD: The similarity scores may be expressed as raw scores
(number of identical residues minus a "gap penalty" for each gap) or
as percentage scores.  If the sequences are of very different lengths,
percentage scores make more sense.
 
 
CHANGING THE PAIRWISE ALIGNMENT PARAMETERS

The main reason for wanting to change the above parameters is SPEED
(especially on microcomputers), NOT SENSITIVITY.  The dendrograms that
are produced can only show the relationships between the sequences
APPROXIMATELY because the similarity scores are calculated from
seperate pairwise alignments; not from a multiple alignment (that is
what we eventually hope to produce). If the groupings of the sequences
are "obvious", the above method should work well; if the relationships
are obscure or weakly represented by the data, it will not make much
difference playing with the parameters.  The main factor influencing
speed is the K-TUPLE SIZE followed by the WINDOW SIZE.

The alignments are carried out in a small amount of memory.
Occasionally (it is hard to predict), you will run out of memory while
doing these alignments; when this happens, it will say on the screen:
"Sequences (a,b) partially aligned" (instead of "Sequences (a,b)
aligned").  This means that the alignment score for these sequences
will be approximate; it is not a problem unless many of the alignments
do this.  It can be fixed by using less sensitive parameters or
increasing parameter FSIZE in clustalv.h .
 
THE DENDROGRAM ITSELF

The similarity scores generated by the fast comparison of all the
sequences are used to construct a dendrogram by the UPGMA method of
Sneath and Sokal (1973).  This is a form of cluster analysis and the
end result produces something that looks like a tree.  It represents
the similarity of the sequences as a hierarchy.  The dendrogram is
written to a file in a machine readable format and is ahown below for
an example with 6 sequences.
 
91.0   0   0   2   012000         ! seq 2 joins seq 3 at 91% ID.
72.0   1   0   3   011200         ! seq 4 joins seqs 2,3 at 72%
71.1   0   0   2   000012         ! seq 5 joins seq 6 at 71%
35.5   0   2   4   122200         ! seq 1 joins seqs 2,3,4
21.7   4   3   6   111122         ! seqs 1,2,3,4 join seqs 5,6

This LOOKS complicated but you do not normally need to care what is in
here. Anyway, each row represents the joining together of 2 or more
sequences.  You progress from the top down, joining more and more
sequences until all are joined together; for N sequences you have N-1
groupings hence there are 5 rows in the above file (there were 6
sequences).  In each row, the first number is the similarity score of
this grouping; ignore the next three columns for the moment; the last
6 digits in the line show which sequences are grouped; there is one
digit for each sequence (the first digit is for the first sequence).
The rule is: in each row, all of the "1"s join all of the "2"s; the
zero's do nothing.

Hence, in the first row, sequence 2 joins sequence 3 at a similarity
level of 91% identity; next, sequence 4 joins the previous grouping of
2 plus 3 at a level of 72% etc.  This is shown diagrammatically below.
Before leaving the dendrogram format, the other 3 columns of numbers
are: a pointer to the row from which the "1" sequences were last
joined (or zero if only one of them); a pointer to the row in which
the "2"s were last joined; the total number of sequences joined in
this line.

 
                       I------ 2
                I------I
                I      I------ 3  Diagram of the sequence similarity
           I----I
           I    I------------- 4  relationships shown in the above
        I--I
        I  I------------------ 1  dendrogram file (branch lengths are
    ----I
        I       I------------- 5  not to scale).
        I-------I
                I------------- 6
 
 
 
MULTIPLE ALIGNMENT PARAMETERS:

Having calculated a dendrogram between a set of sequences, the final
multiple alignment is carried out by a series of alignments of larger
and larger groups of sequences.  The order is determined by the
dendrogram so that the most similar sequences get aligned first.  Any
gaps that are introduced in the early alignments are fixed.  When two
groups of sequences are aligned against each other, a full protein
weight matrix (such as a Dayhoff PAM 250) is used.  Two gap penalties
are offered: a "FIXED" penalty for opening up a gap and a "FLOATING"
penalty for extending a gap.
 
 ********* MULTIPLE ALIGNMENT PARAMETERS *********
 
      1. Fixed Gap Penalty       :10
      2. Floating Gap Penalty    :10
      3. Toggle Transitions (DNA):Weighted
      4. Protein weight matrix   :PAM 250
      H. HELP
 
Enter number (or [RETURN] to exit):
 
FIXED GAP PENALTY: Reduce this to encourage gaps of all sizes;
increase it to discourage them.  Terminal gaps are penalised same as
all others.  BEWARE of making this too small (approx 5 or so); if the
penalty is too small, the program may prefer to align each sequence
opposite one long gap.

FLOATING GAP PENALTY: Reduce this to encourage longer gaps; increase
it to shorten them.  Terminal gaps are penalised same as all others.
BEWARE of making this too small (approx 5 or so); if the penalty is
too small, the program may prefer to align each sequence opposite one
long gap.
 
DNA TRANSITIONS =3D WEIGHTED or UNWEIGHTED: By default, transitions (A
versus G; C versus T) are weighted more strongly than transversions
(an A aligned with a G will be preferred to an A aligned with a C or a
T).  You can make all pairs of nucleotide equally weighted with this
option.

PROTEIN WEIGHT MATRIX: For protein comparisons, a weight matrix is
used to differentially weight different pairs of aligned amino acids.
The default is the well known Dayhoff PAM 250 matrix.  We also offer a
PAM 100 matrix, an identity matrix (all weights are the same for exact
matches) or allow you to give the name of a file with your own matrix.
The weight matrices used by Clustal V are shown in full in the
Algorithms and References section of this documentation.

If you input a matrix from a file, it must be in the following format.
Use a 20x20 matrix only (entries for the 20 "normal" amino acids only;
no ambiguity codes etc.).  Input the lower left triangle of the
matrix, INCLUDING the diagonal.  The order of the amino acids (rows
and columns) must be: CSTPAGNDEQHRKMILVFYW.  The values can be in free
format separated by spaces (not commas).  The PAM 250 matrix is shown
below in this format.

   12
    0  2
   -2  1  3
   -3  1  0  6
   -2  1  1  1  2
   -3  1  0 -1  1  5
   -4  1  0 -1  0  0  2
   -5  0  0 -1  0  1  2  4
   -5  0  0 -1  0  0  1  3  4
   -5 -1 -1  0  0 -1  1  2  2  4
   -3 -1 -1  0 -1 -2  2  1  1  3  6
   -4  0 -1  0 -2 -3  0 -1 -1  1  2  6
   -5  0  0 -1 -1 -2  1  0  0  1  0  3  5
   -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2  0  0  6
   -2 -1  0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2  2  5
   -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3  4  2  6
   -2 -1  0 -1  0 -1 -2 -2 -2 -2 -2 -2 -2  2  4  2  4
   -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5  0  1  2 -1  9
    0 -3 -3 -5 -3 -5 -2 -4 -4 -4  0 -4 -4 -2 -1 -1 -2  7 10
   -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3  2 -3 -4 -5 -2 -6  0  0 17

Values must be integers and can be all positive or positive and
negative as above.  These are SIMILARITY values.

ALIGNMENT OUTPUT OPTIONS:

By default, the alignment goes to a file in a self explanatory
"blocked" alignment format.  This format is fine for displaying the
results but requires heavy editing if you wish to use the alignment
with other software.  To help, we provide 3 other formats which can be
turned on or off.  If you have a sequence data set or alignment in
memory, you can also ask for output files in whatever formats are
turned on, NOW.  The menu you use to choose format is shown below.

*** We draw your attention to NBRF/PIR format in particular.  This
format is EXACTLY the same as one of the input formats.  Therefore,
alignments written in this format can be used again as input (to the
profile alignments or phylogenetic trees).  ***
 
 ********* Format of Alignment Output *********
 
      1. Toggle CLUSTAL format output   =3D  ON
      2. Toggle NBRF/PIR format output  =3D  OFF
      3. Toggle GCG format output       =3D  OFF
      4. Toggle PHYLIP format output    =3D  OFF
      5. Create alignment output file(s) now?

      H. HELP
 
Enter number (or [RETURN] to exit):
 
CLUSTAL FORMAT: This is a self explanatory alignment.  The alignment
is written out in blocks.  Identities are highlighted and (if you use
a PAM 250 matrix) positions in the alignment where all of the residues
are "similar" to each other (PAM 250 score of 8 or more) are
indicated.

NBRF/PIR FORMAT: This is the usual NBRF/PIR format with gaps indicated
by hyphens ("-"). AS we have stressed before, this format is EXACTLY
compatible with the sequence input format.  Therefore you can read in
these alignments again for profile alignments or for calculating
phylogenetic trees.

GCG FORMAT: In version 7 of the Wisconsin GCG package, a new multiple
sequence format was introduced.  This is the MSF (Multiple Sequence
Format) format.  It can be used as input to the GCG sequence editor or
any of the GCG programs that make use of multiple alignments.  THIS
FORMAT IS ONLY SUPPORTED IN VERSION 7 OF THE GCG PACKAGE OR LATER.

PHYLIP FORMAT: This format can be used by the Phylip package of Joe
Felsenstein (see the references/algorithms section for details of how
to get it).  Phylip allows you to do a huge range of phylogenetic
analyses (we just offer one method in this program) and is probably
the most widely used set of programs for drawing trees.  It also works
on just about every computer you can think of, providing you have a
decent Pascal compiler.

PROFILE ALIGNMENT MENU.

This menu is for taking two old alignments (or single sequences) and
aligning them with each other.  The result is one bigger alignment.
The menu is very similar to the multiple alignment menu except that
there is no mention of dendrograms here (they are not needed) and you
need to input two sets of sequences.  The menu looks like this:
 
******Profile*Alignment*Menu******
 
     1.  Input 1st. profile/sequence
     2.  Input 2nd. profile/sequence
     3.  Do alignment now
     4.  Alignment parameters
     5.  Output format options

     S.  Execute a system command
     H.  HELP

     or press [RETURN] to go back to main menu
 
Your choice:
 
You must input profile number 1 first.  When both profiles are loaded,
use item 3 (Do alignment now) and the 2 profiles will be aligned.
Items 4 and 5 (parameters and output options) are identical to the
equivalent options on the multiple alignment menu.

The same input routines that are used for general input are used here
i.e. sequences must be in NBRF/PIR, EMBL/SwissProt or FASTA format,
with gaps indicated by hyphens ("-").  This is why we have continualy
drawn your attention to the NBRF/PIR format as a useful output format.

Either profile can consist of just one sequence.  Therefore, if you
have a favourite alignment of sequences that you are working on and
wish to add a new sequence, you can use this menu, provided the
alignment is in the correct format.

The total number of sequences in the two profiles must be less less
than or equal to the MAXN parameter set in the clustalv.h header file.
 
PHYLOGENETIC TREE MENU.

This menu allows you to input an alignment and calculate a
phylogenetic tree.  You can also calculate a tree if you have just
carried out a multiple alignment and the alignment is still in memory.
THE SEQUENCES MUST BE ALIGNED ALREADY!!!!!!  The tree will look
strange if the sequences are not already aligned.  You can also
"BOOTSTRAP" the tree to show confidence levels for groupings.  This is
SLOW on microcomputers but works fine on workstations or mainframes.
 
******Phylogenetic*tree*Menu******
 
     1.  Input an alignment
     2.  Exclude positions with gaps?        =3D OFF
     3.  Correct for multiple substitutions? =3D OFF
     4.  Draw tree now
     5.  Bootstrap tree

     S.  Execute a system command
     H.  HELP

     or press [RETURN] to go back to main menu
 
Your choice:
 
The same input routine that is used for general input is used here
i.e. sequences must be in NBRF/PIR, EMBL/SwissProt or FASTA format,
with gaps indicated by hyphens ("-").  This is why we have continualy
drawn your attention to the NBRF/PIR format as a useful output format.

If you have input an alignment, then just use item 4 to draw a tree.
The method used is the Neighbor Joining method of Saitou and Nei
(1987).  This is a "distance method". First, percent divergence
figures are calculated between all pairs of sequence.  These
divergence figures are then used by the NJ method to give the tree.
Example trees will be shown below.

There are two options which can be used to control the way the
distances are calculated.  These are set by options 2 and 3 in the
menu.

EXCLUDE POSITIONS WITH GAPS?  This option allows you to ignore all
alignment positions (columns) where there is a gap in ANY sequence.
This guarantees that "like" is compared with "like" in all distances
i.e. the same positions are used to calculate all distances.  It also
means that the distances will be "metric".  The disadvantage of using
this option is that you throw away much of the data if there are many
gaps.  If the total number of gaps is small, it has little effect.

CORRECT FOR MULTIPLE SUBSTITUTIONS?  As sequences diverge,
substitutions accumulate.  It becomes increasingly likely that more
than one substitution (as a result of a mutation) will have happened
at a site where you observe just one difference now.  This option
allows you to use formulae developed by Motoo Kimura to correct for
this effect.  It has the effect of stretching long branches in tres
while leaving short ones relatively untouched.  The desired effect is
to try and make distances proportional to time since divergence.

The tree is sent to a file called BLAH.NJ, where BLAH.SEQ is the name
of the input, alignment file.  An example is shown below for 6 globin
sequences.
 
 
 DIST   =3D percentage divergence (/100) 
 Length =3D number of sites used in comparison
    1 vs.   2  DIST =3D 0.5683;  length =3D    139
    1 vs.   3  DIST =3D 0.5540;  length =3D    139
    1 vs.   4  DIST =3D 0.5315;  length =3D    111
    1 vs.   5  DIST =3D 0.7447;  length =3D    141
    1 vs.   6  DIST =3D 0.7571;  length =3D    140
    2 vs.   3  DIST =3D 0.0897;  length =3D    145
    2 vs.   4  DIST =3D 0.1391;  length =3D    115
    2 vs.   5  DIST =3D 0.7517;  length =3D    145
    2 vs.   6  DIST =3D 0.7431;  length =3D    144
    3 vs.   4  DIST =3D 0.0957;  length =3D    115
    3 vs.   5  DIST =3D 0.7379;  length =3D    145
    3 vs.   6  DIST =3D 0.7361;  length =3D    144
    4 vs.   5  DIST =3D 0.7304;  length =3D    115
    4 vs.   6  DIST =3D 0.7368;  length =3D    114
    5 vs.   6  DIST =3D 0.2697;  length =3D    152
 
Neighbor-joining Method

 Saitou, N. and Nei, M. (1987) The Neighbor-joining Method: A New
Method for Reconstructing Phylogenetic Trees.  Mol. Biol. Evol., 4(4),
406-425
 
 This is an UNROOTED tree
 Numbers in parentheses are branch lengths
 
 Cycle   1     =3D  SEQ:   5 (  0.13382) joins  SEQ:   6 (  0.13592)
 Cycle   2     =3D  SEQ:   1 (  0.28142) joins Node:   5 (  0.33462)
 Cycle   3     =3D  SEQ:   2 (  0.05879) joins  SEQ:   3 (  0.03086)
 Cycle   4 (Last cycle, trichotomy):
		 Node:   1 (  0.20798) joins
 		 Node:   2 (  0.02341) joins
 		  SEQ:   4 (  0.04915)
 
 
The output file first shows the percent divergence (distance) figures
between each pair of sequence.  Then a description of a NJ tree is
given.  This description shows which sequences (SEQ:) or which groups
of sequences (NODE: , a node is numbered using the lowest sequence
that belongs to it) join at each level of the tree.

This is an unrooted tree!! This means that the direction of evolution
through the tree is not shown.  This can only be inferred in one of
two ways: 1) assume a degree of constancy in the molecular clock and
place the root (bottom of the tree; the point where all the sequences
radiate from) half way along the longest branch.  **OR** 2) use an
"outgroup", a sequence from an organism that you "know" must be
outside of the rest of the sequences i.e. root the tree manually, on
biological grounds.

The above tree can be represented diagramatically as follows:
 
                          SEQ 1       SEQ 4
                            I           I
           13.6             I 28.1      I 4.9          5.9
   SEQ 6 ----------I        I           I          I--------- SEQ 2
                   I        I           I          I
                   I--------I-----------I----------I
           13.4    I  33.5      20.8         2.3   I   3.1
   SEQ 5 ----------I                               I--------- SEQ 3
 
The figures along each branch are percent divergences along that
branch.  If you root the tree by placing the root along the longest
branch (33.5%) then you can draw it again as follows, this time
rooted:
 
 
                         13.6
                 I-------------------- SEQ 6
       I---------I       13.4
       I         I-------------------- SEQ 5
       I 33.5
  -----I                 28.1
       I         I-------------------- SEQ 1
       I         I
       I---------I                4.9
                 I  20.8  I----------- SEQ 4
                 I--------I
                          I       5.9
                          I 2.3 I----- SEQ 2
                          I-----I 3.1
                                I----- SEQ 3
 
 
The longest branch (33.5% between 5,6 and 1,2,3,4) is split between
the 2 bottom branches of the tree.  As it happens in this particular
case, sequences 5 and 6 are myoglobins while sequences 1,2,3 and 4 are
alpha and beta globins, so you could also justify the above rooting on
biological grounds.  If you do not have any particular need or
evidence for the position of the root, then LEAVE THE TREE UNROOTED.
Unrooted trees do not look as pretty as rooted ones but it is usual to
leave them unrooted if you do not have any evidence for the position
of the root.

 
BOTSTRAPPING: Different sets of sequences and different tree drawing
methods may give different topologies (branching orders) for parts of
a tree that are weakly supported by the data.  It is useful to have an
indication of the degree of error in the tree.  There are several ways
of doing this, some of them rather technical.  We provide one general
purpose method in this program, which makes use of a technique called
bootstrapping (see Felsenstein, 1985).

In the case of sequence alignments, bootstrapping involves taking
random samples of positions from the alignment.  If the alignment has
N positions, each bootstrap sample consists of a random sample of N
positions, taken WITH REPLACEMENT i.e. in any given sample, some sites
may be sampled several times, others not at all.  Then, with each
sample of sites, you calculate a distance matrix as usual and draw a
tree.  If the data very strongly support just one tree then the sample
trees will be very similar to each other and to the original tree,
drawn without bootstrapping.  However, if parts of the tree are not
well supported, then the sample trees will vary considerably in how
they represent these parts.

In practice, you should use a very large number of bootstrap
replicates (1000 is recommended, even if it means running the program
for an hour on a slow microcomputer; on a workstation it will be MUCH
faster).  For each grouping on the tree, you record the number of
times this grouping occurs in the sample trees.  For a group to be
considered "significant" at the 95% level (or P <=3D 0.05 in statistical
terms) you expect the grouping to show up in >=3D 95% of the sample
trees.  If this happens, then you can say that the grouping is
significant, given the data set and the method used to draw the tree.

So, when you use the bootstrap option, a NJ tree is drawn as before
and then you are asked to say how many bootstrap samples you want
(1000 is the default) and you are asked to give a seed number for the
random number generator.  If you give the same seed number in future,
you will get the same results (we hope).  Remember to give different
seed numbers if you wish to carry out genuinely different bootstrap
sampling experiments.  Below is the output file from using the same
data for the 6 globin sequences as used before. The output file has
the same name as the input fike with the extension ".njb".

Bootstrap Confidence Limits
 
 Random number generator seed =3D      99
 Number of bootstrap trials   =3D    1000
 
 Diagrammatic representation of the above tree:
 Each row represents 1 tree cycle; defining 2 groups.
 Each column is 1 sequence; the stars in each line show 1 group;  the dots 
  show the other
 Numbers show occurences in bootstrap samples.

****..   1000 
.***..   1000                <- This is the answer!!
*..***    812
122311
 
For an unrooted tree with N sequences, there are actually only N-3
genuinely different groupings that we can test (this is the number of
"internal branches"; each internal branch splits the sequences into 2
groups).  In this example, we have 6 sequences with 3 internal
branches in the reference tree. In the bootstrap resampling, we count
how often each of these internal branches occur.  Here, we find that
the branch which splits 1,2,3 and 4 versus 5 and 6 occurs in all 1000
samples; the branch which splits 2,3 and 4 versus 1,5 and 6 occurs in
1000; the branch which splits 2 and 3 versus 1,4,5 and 6 occurs in
812/1000 samples.  We can put these figures on to the diagrammatic
representation we made earlier of our unrooted NJ tree as follows:
 
 
                          SEQ 1       SEQ 4
                            I           I
                            I           I
   SEQ 6 ----------I        I           I          I--------- SEQ 2
                   I  1000  I   1000    I   812    I
                   I--------I-----------I----------I
                   I                               I
   SEQ 5 ----------I                               I--------- SEQ 3
 
 
You can equally put these confidence figures on the rooted tree (in
fact the interpretation is simpler with rooted trees).  With the
unrooted tree, the grouping of sequence 5 with 6 is significant (as is
the grouping of sequences 1,2,3 and 4).  Equally the grouping of
sequences 1,5 and 6 is significant (the same as saying that 2,3 and 4
group significantly).  However, the grouping of 2 and 3 is not
significant, although it is relatively strongly supported.

Unfortunately, there is a small complication in the interpretation of
these results.  In statistical hypothesis testing, it is not valid to
make multiple simultaneous tests and to treat the result of each test
completely independantly.  In the above case, if you have one
particular test (grouping) that you wish to make in advance, it is
valid to test IT ALONE and to simply show the other bootstrap figures
for reference.  If you do not have any particular test in mind before
you do the bootstrapping, you can just show all of the figures and use
the 95% level as an ARBITRARY cut off to show those groups that are
very strongly supported; but not mention anything about SIGNIFICANCE
testing.  In the literature, it is common practice to simply show the
figures with a tree; they frequently speak for themselves.
 

3.  Command Line Interface.

You can do almost everything that can be done from the menus, using a
command line interface. In this mode, the program will take all of its
instructions as "switches" when you activate it; no questions will be
asked; if there are no errors, the program just does an analysis and
stops.  It does not work so well on the MAC but is still possible.  To
get you started we will show you the 2 simplest uses of the command
line as it looks on VAX/VMS.  On all other machines (except the MAC)
it works in the same way.

$ clustalv /help           **OR**   $ clustalv /check

Both of the above switches give you a one page summary of the command
line on the screen and then the program stops.
 
$ clustalv proteins.seq    **OR**   $ clustalv /infileDproteins.seq

This will read the sequences from the file 'proteins.seq' and do a
complete multiple alignment.  Default parameters will be used, the
program will try to tell whether or not the sequences are DNA or
protein and the output will go to a file called 'proteins.aln' . A
dendrogram file called 'proteins.dnd' will also be created.  Thus the
default action for the program, when it successfully reads in an input
file is to do a full multiple alignment.  Some further examples of
command line usage will be given leter.

Command line switches can be abbreviated but MAKE SURE YOU DO NOT MAKE
THEM AMBIGUOUS.  No attempt will be made to detect ambiguity.  Use
enough characters to distinguish each switch uniquely.
 
The full list of allowed switches is given below:
 
                DATA (sequences) 

/INFILE=3Dfile.ext :input sequences.  If you give an input file and
nothing else as a switch, the default action is to do a complete
multiple alignment.  The input file can also be specified by giving it
as the first command line parameter with no "/" in front of it e.g $
clustalv file.ext .

/PROFILE1=3Dfile.ext	:You use these two switches to give the names
of /PROFILE2=3Dfile.ext	two profiles.  The default action is to align
the two. You must give the names of both profile files.
 
 
                VERBS (do things)

/HELP  			:list the command line parameters on the screen.  
/CHECK

/ALIGN 			:do full multiple alignment.  This is the
default action if no other switches except for input files are given.

/TREE 			:calculate NJ tree.  If this is the only
action specified (e.g. $ clustalv proteins.seq/tree ) it IS ASSUMED
THAT THE SEQUENCES ARE ALREADY ALIGNED.  If the sequences are not
already aligned, you should also give the /ALIGN switch.  This will
align the sequences first, output an alignment file and calculate the
tree in memory.

/BOOTSTRAP(=3Dn)	:bootstrap a NJ tree (n=3D number of bootstraps; 
default 1000).  If this is the only action specified (e.g. $ clustalv
proteins.seq/bootstrap ) it IS ASSUMED THAT THE SEQUENCES ARE ALREADY
ALIGNED.  If the sequences are not already aligned, you should also
give the /ALIGN switch.  This will align the sequences first, output
an alignment file and calculate the bootstraps in memory.  You can set
the number of bootstrap trials here (e.g./bootstrap=3D500).  You can set
the seed number for the random number generator with /seed=3Dn.
 
 
                PARAMETERS (set things)
***Pairwise alignments:***
/KTUP=3Dn      	:word size
/TOPDIAGS=3Dn  	:number of best diagonals
/WINDOW=3Dn    	:window around best diagonals
/PAIRGAP=3Dn   	:gap penalty
 
 
***Multiple alignments:***
/FIXEDGAP=3Dn  	:fixed length gap pen.

/FLOATGAP=3Dn  	:variable length gap pen.


/MATRIX=3D 	:PAM100 or ID or file name. The default weight matrix
for proteins is PAM 250.

/TYPE=3Dp or d 	:type is protein or DNA.  This allows you to
explicitely overide the programs attempt at guessing the type of the
sequence.  It is only useful if you are using sequences with a VERY
strange composition.

/OUTPUT=3D     	:GCG or PHYLIP or PIR.  The default output is Clustal format.

/TRANSIT 	:transitions not weighted.  The default is to weight
transitions as more favourable than other mismatches in DNA
alignments.  This switch makes all nucleotide mismatches equally
weighted.
 
***Trees:***
/KIMURA      	:use Kimura's correction on distances.
/TOSSGAPS    	:ignore positions with a gap in ANY sequence.
/SEED=3Dn      	:seed number for bootstraps.
 
 
 
EXAMPLES:

These examples use the VAX/VMS $ prompt; otherwise, command-line usage
is the same on all machines except the Macintosh.
 
$ clustalv proteins.seq      OR     $ clustalv /infile=3Dproteins.seq

Read whatever sequences are in the file "proteins.seq" and do a full
multiple alignment; output will go to the files: "proteins.dnd"
(dendrogram) and "proteins.aln" (alignment).
 
$ clustalv proteins.seq/ktup=3D2/matrix=3Dpam100/output=3Dpir

Same as last example but use K-Tuple size of 2; use a PAM 100 protein
weight matrix; write the alignment out in NBRF/PIR format (goes to a
file called "proteins.pir").
 
$ clustalv /profile1=3Dproteins.seq/profile2=3Dmore.seq/type=3Dp/fixed=3D11

Take the alignment in "proteins.seq" and align it with "more.seq"
using default values for everything except the fixed gap penalty which
is set to 11.  The sequence type is explicitely set to PROTEIN.
 
$ clustalv proteins.pir/tree/kimura

Take the sequences in proteins.pir (they MUST BE ALIGNED ALREADY) and
calculate a phylogenetic tree using Kimura's correction for distances.
 
$ clustalv proteins.pir/align/tree/kimura

Same as the previous example, EXCEPT THAT AN ALIGNMENT IS DONE FIRST.
 
$ clustalv proteins.seq/align/boot=3D500/seed=3D99/tossgaps/type=3Dp

Take the sequences in proteins.seq; they are explicitely set to be
protein; align them; bootstrap a tree using 500 samples and a seed
number of 99.
 

4.  Algorithms and references.


In this section, we will try to BRIEFLY describe the algorithms used
in ClustalV and give references.  The topics covered are:
 
	-Multiple alignments
	-Profile alignments
	-Protein weight matrices
	-Phylogenetic trees
		-distances
		-NJ method
		-Bootstrapping
		-Phylip
	-References
 
 
MULTIPLE ALIGNMENTS.

The approach used in ClustalV is a modified version of the method of
Feng and Doolittle (1987) who aligned the sequences in larger and
larger groups according to the branching order in an initial
phylogenetic tree.  This approach allows a very useful combination of
computational tractability and sensitivity.

The positions of gaps that are generated in early alignments remain
through later stages.  This can be justified because gaps that arise
from the comparison of closely related sequences should not be moved
because of later alignment with more distantly related sequences.  At
each alignment stage, you align two groups of already aligned
sequences.  This is done using a dynamic programming algorithm where
one allows the residues that occur in every sequence at each alignment
position to contribute to the alignment score.  A Dayhoff (1978) PAM
matrix is used in protein comparisons.

The details of the algorithm used in ClustalV have been published in
Higgins and Sharp (1989).  This was an improved version of an earlier
algorithm published in Higgins and Sharp (1988).  First, you calculate
a crude similarity measure between every pair of sequence.  This is
done using the fast, approximate alignment algorithm of Wilbur and
Lipman (1983).  Then, these scores are used to calculate a "guide
tree" or dendrogram, which will tell the multiple alignment stage in
which order to align the sequences for the final multiple alignment.
This "guide tree" is calculated using the UPGMA method of Sneath and
Sokal (1973).  UPGMA is a fancy name for one type of average linkage
cluster analysis, invented by Sokal and Michener (1958).

Having calculated the dendrogram, the sequences are aligned in larger
and larger groups.  At each alignment stage, we use the algorithm of
Myers and Miller (1988) for the optimal alignments.  This algorithm is
a very memory efficient variation of Gotoh's algorithm (Gotoh, 1982).
It is because of this algorithm that ClustalV can work on
microcomputers.  Each of these alignments consists of aligning 2
alignments, using what we call "profile alignments".
 
PROFILE ALIGNMENTS.

We use the term "profile alignment" to describe the alignment of 2
alignments.  We use this term because the method is a simple extension
of the profile method of Gribskov, et al. (1987) for aligning 1
sequence with an alignment.  Normally, with a 2 sequence alignment,
you use a weight matrix (e.g. a PAM 250 matrix) to give a score
between the pairs of aligned residues.  The alignment is considered
"optimal" if it gives the best total score for aligned residues minus
penalties for any gaps (insertions or deletions) that must be
introduced.

Profile alignments are a simple extension of 2 sequence alignments in
that you can treat each of the two input alignments as single
sequences but you calculate the score at aligned positions as the
average weight matrix score of all the residues in one alignment
versus all those in the other e.g. if you have 2 alignments with I and
J sequences respectively; the score at any position is the average of
all the I times J scores of the residues compared seperately.  Any
gaps that are introduced are placed in all of the sequences of an
alignment at the same position.  The profile alignments offered in the
"profile alignment menu" are also calculated in this way.
 
PROTEIN WEIGHT MATRICES.

There are 3 built-in weight matrices used by clustalV.  These are the
PAM 100 and PAM 250 matrices of Dayhoff (1978) and an identity matrix.
Each matrix is given as the bottom left half, including the diagonal
of a 20 by 20 matrix.  The order of the rows and columns is
CSTPAGNDEQHRKMILVFYW.
 

PAM 250
C  12 
S   0  2 
T  -2  1  3 
P  -3  1  0  6 
A  -2  1  1  1  2 
G  -3  1  0 -1  1  5 
N  -4  1  0 -1  0  0  2 
D  -5  0  0 -1  0  1  2  4 
E  -5  0  0 -1  0  0  1  3  4 
Q  -5 -1 -1  0  0 -1  1  2  2  4 
H  -3 -1 -1  0 -1 -2  2  1  1  3  6 
R  -4  0 -1  0 -2 -3  0 -1 -1  1  2  6 
K  -5  0  0 -1 -1 -2  1  0  0  1  0  3  5 
M  -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2  0  0  6 
I  -2 -1  0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2  2  5 
L  -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3  4  2  6 
V  -2 -1  0 -1  0 -1 -2 -2 -2 -2 -2 -2 -2  2  4  2  4 
F  -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5  0  1  2 -1  9 
Y   0 -3 -3 -5 -3 -5 -2 -4 -4 -4  0 -4 -4 -2 -1 -1 -2  7 10 
W  -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3  2 -3 -4 -5 -2 -6  0  0 17 
----------------------------------------------------------------
    C  S  T  P  A  G  N  D  E  Q  H  R  K  M  I  L  V  F  Y  W
 
IDENTITY MATRIX
  10  
  0 10  
  0  0 10  
  0  0  0 10  
  0  0  0  0 10  
  0  0  0  0  0 10  
  0  0  0  0  0  0 10  
  0  0  0  0  0  0  0 10  
  0  0  0  0  0  0  0  0 10  
  0  0  0  0  0  0  0  0  0 10  
  0  0  0  0  0  0  0  0  0  0 10  
  0  0  0  0  0  0  0  0  0  0  0 10  
  0  0  0  0  0  0  0  0  0  0  0  0 10
  0  0  0  0  0  0  0  0  0  0  0  0  0 10
  0  0  0  0  0  0  0  0  0  0  0  0  0  0 10
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 10
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 10
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 10
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 10
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 10
 
 
 
 

PAM 100
  14
  -1  6
  -5  2   7
  -6  1  -1  10
  -5  2   2   1   6
  -8  1  -3  -3   1   8
  -8  2   0  -3  -1  -1  7
 -11 -1  -2  -4  -1  -1  4   8
 -11 -2  -3  -3   0  -2  1   5   8
 -11 -3  -3  -1  -2  -5 -1   1   4   9
  -6 -4  -5  -2  -5  -7  2  -1  -2   4 11
  -6 -1  -4, -2  -5  -8 -3  -6  -5   1  1 10
 -11 -2  -1  -4  -4  -5  1  -2  -2  -1 -3  3  8
 -11 -4  -2  -6  -3  -8 -5  -8  -6  -2 -7 -2  1 13
  -5 -4  -1  -6  -3  -7 -4  -6  -5  -5 -7 -4  4  2  9
 -12 -7  -5  -5  -5  -8 -6  -9  -7  -3 -5 -7 -6  4  2  9
  -4 -4  -1  -4   0  -4 -5  -6  -5  -5 -6 -6 -6  1  5  1  8
 -10 -5  -6  -9  -7  -8 -6 -11 -11 -10 -4 -7-11 -2  0  0 -5 12
  -2 -6  -6 -11  -6 -11 -3  -9  -7  -9 -1-10-10 -8 -4 -5 -6  6 13
 -13 -4 -10 -11 -11 -13 -8 -13 -14 -11 -7  1 -9-11-12 -7-14 -2 -2 19
 
 
 
PHYLOGENETIC TREES.

There are two COMMONLY used approaches for inferring phylogentic trees
from sequence data: parsimony and distance methods. There are other
approaches which are probably superior in theory but which are yet to
be used widely. This does not mean that they are no use; we (the
authors of this program at any rate) simply do not know enough about
them yet.  You should see the documentation accompanying the Phylip
package and some of the references there for an explanation of the
different methods and what assumptions are implied when you use them.

There is a constant debate in the literature as to the merits of
different methods but unfortunately, a lot of what is said is
incomprehensible or inaccurate.  It is also a field that is prone to
having highly opinionated schools of thought.  This is a pity because
it prevents rational discussion of the pro's and con's of the
different methods.  The approach adopted in ClustalV is to supply just
one method and to produce alignments in a format that can be used by
Phylip.  In simple cases, the trees produced will be as "good"
(reliable, robust) as those from ANY other method.  In more
complicated cases, there is no single magic recipe that we can supply
that will work well in even most situations.

The method we provide is the Neighbor Joining method (NJ) of Saitou
and Nei (1987) which is a distance method.  We use this for three
reasons: it is conceptually and computationally simple; it is fast; it
gives "good" trees in simple cases. It is difficult to prove that one
tree is "better" than another if you do not know the true phylogeny;
the few systematic surveys of methods show it to work more or less as
well as any other method ON AVERAGE. Another reason for using the NJ
method is that it is very commonly used; THIS IS A BAD REASON
SCIENTIFICALLY but at least you will not feel lonely if you use it.

The NJ method works on a matrix of distances (the distance matrix)
between all pairs of sequence to be analysed.  These distances are
related to the degree of divergence between the sequences.  It is
normal to calculate the distances from the sequences after they are
multiply aligned.  If you calculate them from seperate alignments (as
done for the dendrograms in another part of this program), you may
increase the error considerably.
 
DISTANCES

The simplest measure of distance between sequences is percent
divergence (100% minus percent identity).  For two sequences, you
count how many positions differ between them (ignoring all positions
with a gap or an unknown residue) and divide by the number of
positions considered.  It is common practice to also ignore all
positions in the alignment where there is a GAP in ANY of the
sequences (Tossgaps ? option in the menu).  Usually, you express the
percent distance divided by 100 (gives distances between 0.0 and 1.0).

This measure of distance is perfectly adequate (with some further
modification described below) for rRNA sequences. However it treats
all residues identically e.g. all amino acid substitutions are equally
weighted. It also treats all positions identically e.g. it does not
take account of different rates of substitution in different positions
of different codons in protein coding DNA sequences; see Li et al
(1985) for a distance measure that does.  Despite these shortcomings,
these percent identity distances do work well in practice in a wide
variety of situations.

In a simple world, you would like a distance to be proportional to the
time since the sequences diverged.  If this were EXACTLY true, then
the calculation of the tree would be a simple matter of algebra (UPGMA
does this for you) and the branch lengths will be nice and meaningful
(times).  In practice this OBVIOUSLY depends on the existence and
quality of the "molecular clock", a subject of on- going debate.
However, even if there is a good clock, there is a further problem
with estimating divergences.  As sequences diverge, they become
"saturated" with mutations.  Sites can have substitutions more than
once.  Calculated distances will underestimate actual divergence
times; the greater the divergence, the greater the discrepancy.  There
are various methods for dealing with this and we provide two commonly
used ones, both due to Motoo Kimura; one for proteins and one for DNA.
 
For distance K (percent divergence /100 ) ...
Correction for Protein distances:  (Kimura, 1983).
       Corrected K =3D -ln(1.0 - K - (K * k/5.0))
 
 
Correction for nucleotide distances: Kimura's 2-parameter method 
(Kimura, 1980).

       Corrected K =3D 0.5*ln(a) + 0.25*ln(b)
       where     a =3D 1/(1 - 2*P - Q)        and       b =3D 1/(1 - 2*Q)
       P and Q are the proportions of transitions (A<-->G, C<-->T) and 
transversions occuring between the sequences.
 
One paradoxical effect of these corrections, is that distances can be
corrected to have more than 100% divergence.  That is because, for
very highly diverged sequences of length N, you can estimate that more
than N substitutions have occured by correcting the observed distance
in the above ways.  Don't panic!
 
 
NEIGHBOR JOINING TREES.

VERY briefly, the NJ method works as follows.  You start by placing
the sequences in a star topology (no internal branches).  You then
find that internal branch (take 2 sequences; join them; connect them
to the rest by the internal branch) which when added to the tree will
minimise the total branch length. The two joined sequences
(neighbours) are merged into a single sequence and the process is
repeated.  For an unrooted tree with N sequences, there are N-3
internal branches.  The above process is repeated N-3 times to give
the final tree.  The full details are given in Saitou and Nei (1987).

As explained elsewhere in the documentation, you can only root the
tree by one of two methods:

1) assume a degree of constancy in the molecular clock and place the
root along the longest branch (internal or external).  Methods that
appear to produce rooted trees automatically are often just doing this
without letting you know; this is true of UPGMA.

2) root the tree on biological grounds.  The usual method is to
include an "outgroup", a sequence that you are certain will branch to
the outside of the tree.
 
 
BOOTSTRAPPING.

Bootstrapping is a general purpose technique that can be used for
placing confidence limits on statistics that you estimate without any
knowledge of the underlying distribution (e.g. a normal or poisson
distribution).  In the case of phylogenetic trees, there are several
analytical methods for placing confidence limits on groupings
(actually on the internal branches) but these are either restricted to
particular tree drawing methods or only work on small trees of 4 or 5
sequences.  Felsenstein (1985) showed how to use bootstrapping to
calculate confidence limits on trees.  His approach is completely
general and can be applied to any tree drawing method.  The main
assumption of the method in this context is that the sites in the
alignment are independant; this will be true of some sequence
alignments (e.g. pseudogenes) but not others (e.g. rRNA's).  What
effect, lack of independance will have on the results is not known.

The method works by taking random samples of data from the complete
data set.  You compute the test statistic (tree in this case) on each
sample.  Variation in the statistic computed from the samples gives a
measure of variation in the statistic which can be used to calculate
confidence intervals.  Each random sample is the same size as the
complete data set and is taken WITH REPLACEMENT i.e. a data point can
be selected more than once (or not at all) in any given sample.

In the case of an alignment N residues long, each random sample is a
random selection of N sites form the alignment.  For each sample, we
calculate a distance matrix and tree in the usual way.  Variation in
the sample trees compared to a tree calculated from the full data set
gives an indication of how well supported the tree is by the data.  If
the sample trees are very similar to each other and to the full tree,
then the tree is "strongly" supported; if the sample trees show great
variation, then the tree will be weakly supported.  In practice, you
usually find some parts of a tree well supported, others weakly.  This
can be seen by counting how often each monophyletic group in the full
tree occurs in the sample trees.

For a particular grouping, one considers it to be significant at the
95% level (P <=3D 0.05) if it occurs in 95% of the bootstrap samples.
If a grouping is significant, it is significant with respect to the
particular data set and method used for drawing the tree.  Biological
"significance" is another matter.
 
PHYLIP.

The Phylip package was written by Joe Felsenstein, University of
Washington, USA.  It provides Pascal source code for a large number of
programs for doing most types of phylogenetic analyses.  The Phylip
format alignments produced by this program can be used by all of the
Phylip programs, version 3.4 or later (March 1991).  It is freely
available from him as follows.
 
 
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D PHYLIP information sheet =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

    PHYLIP - Phylogeny Inference Package (version 3.3)

This is a FREE package of programs for inferring phylogenies and
carrying out certain related tasks.  At present it contains 28
programs, which carry out different algorithms on different kinds of
data.  The programs in the package are:

Programs for molecular sequence data 

PROTPARS  Protein parsimony 
DNAPARS   Parsimony method for DNA 
DNAMOVE   Interactive DNA parsimony 
DNAPENNY  Branch and bound for DNA 
DNABOOT   Bootstraps DNA parsimony 
DNACOMP   Compatibility for DNA 
DNAINVAR  Phylogenetic invariants 
DNAML     Maximum likelihood method 
DNAMLK    DNAML with molecular clock 
DNADIST   Distances from sequences 
RESTML    ML for restriction sites

Programs for distance matrix data 

FITCH     Fitch-Margoliash and least-squares methods 
KITSCH    Fitch-Margoliash and least squares methods with evolutionary clock

Programs for gene frequencies and continuous characters

CONTML    Maximum likelihood method 
GENDIST   Computes genetic distances

Programs for discrete state data 

MIX       Wagner, Camin-Sokal, and mixed parsimony criteria 
MOVE      Interactive Wagner, C-S, mixed parsimony program 
PENNY     Finds all most parsimonious trees by branch-and-bound 
BOOT      Bootstrap confidence interval on mixed parsimony methods 
DOLLOP, DOLMOVE, DOLPENNY, DOLBOOT   same as preceding four programs, but for 
the Dollo and  polymorphism parsimony  criteria 
CLIQUE    Compatibility method 
FACTOR    recode multistate characters

Programs for plotting trees and consensus trees 

DRAWGRAM  Draws cladograms and phenograms on screens, plotters and printers 

DRAWTREE  Draws unrooted phylogenies on screens, plotters and printers 

CONSENSE Majority-rule and strict consensus trees The package includes
extensive documentation files that provide the information necessary
to use and modify the programs.

COMPATIBILITY: The programs are written in a very standard subset of
Pascal, a language that is available on most computers (including
microcomputers). The programs require only trivial modifications to
run on most machines: for example they work with only minor
modifications with Turbo Pascal, and without modifications on VAX VMS
Pascal. Pascal source code is distributed in the regular version of
PHYLIP: compiled object code is not.  To use that version, you must
have a Pascal compiler.

DISKETTE DISTRIBUTION: The package is distributed in a variety of
microcomputer diskette formats.  You should send FORMATTED diskettes,
which I will return with the package written on them.  Unfortunately,
I cannot write any Apple formats.  See below for how many diskettes to
send.  The programs on the magnetic tape or electronic network
versions may of course also be moved to microcomputers using a
terminal program.

PRECOMPILED VERSIONS: Precompiled executable programs for PCDOS
systems are available from me.  Specify the "PCDOS executable version"
and send the number of extra diskettes indicated below.  An Apple
Macintosh version with precompiled code is available from Willem
Ellis, Instituut voor Taxonomische Zoologie, Zoologisch Museum,
Universiteit van Amsterdam, Plantage Middenlaan 64, 1018DH Amsterdam,
Netherlands, who asks that you send 5 800K diskettes.

HOW MANY DISKETTES TO SEND: The following table shows for different
PCDOS formats how many diskettes to send, and how many extra diskettes
to send for the PCDOS executable version:

Diskette size	Density	  For source code	For executables, send additional
3.5 inch	1.44 Mb		2				1
5.25 inch	1.2 Mb		2				2
3.5 inch	720 Kb		4				2
5.25 inch	360 Kb		7				4

Some other formats are also available. You MUST tell me EXACTLY which
of these formats you need.  The diskettes MUST be formatted by you
before being sent to me. Sending an extra diskette may be helpful.

NETWORK DISTRIBUTION: The package is also available by distribution of
the files directly over electronic networks, and by anonymous ftp from
evolution.genetics.washington.edu.  Contact me by electronic mail for
details.

TAPE DISTRIBUTION: The programs are also distributed on a magnetic
tape provided by you (which should be a small tape and need only be
able to hold two megabytes) in the following format: 9-track, ASCII,
odd parity, unlabelled, 6250 bpi (unless otherwise indicated).
Logical record: 80 bytes, physical record: 3200 bytes (i.e. blocking
factor 40). There are a total of 71 files. The first one describes the
contents of the package.

POLICIES: The package is distributed free.  I do not make it available
or support it in South Africa.  The package will be written on the
diskettes or tape, which will be mailed back.  They can be sent to:

Joe Felsenstein Electronic mail addresses: Department of Genetics
SK-50 Internet: joe@genetics.washington.edu University of Washington
Bitnet/EARN: felsenst@uwavm Seattle, Washington 98195 UUCP:
uw-beaver!evolution.genetics!joe U.S.A.
 
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D End of Phylip Info. Sheet =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=

 
 
 
REFERENCES.

Dayhoff, M.O., Schwartz, R.M. and Orcutt, B.C. (1978) in Atlas of
Protein Sequence and Structure, Vol. 5 supplement 3, Dayhoff, M.O.
(ed.), NBRF, Washington, p. 345.

Felsenstein, J. (1985) Confidence limits on phylogenies: an approach
using the bootstrap.  Evolution 39, 783-791.

Feng, D.-F. and Doolittle, R.F. (1987) Progressive sequence alignment
as a prerequisite to correct phylogenetic trees.  J.Mol.Evol. 25,
351-360.

Gotoh, O. (1982) An improved algorithm for matching biological
sequences.  J.Mol.Biol. 162, 705-708.

Gribskov, M., McLachlan, A.D. and Eisenberg, D. (1987) Profile
analysis: detection of distantly related proteins. PNAS USA 84,
4355-4358.

Higgins, D.G. and Sharp, P.M. (1988) CLUSTAL: a package for performing
multiple sequence alignments on a microcomputer.  Gene 73, 237-244.

Higgins, D.G. and Sharp, P.M. (1989) Fast and sensitive multiple
sequence alignments on a microcomputer.  CABIOS 5, 151-153.

Kimura, M. (1980) A simple method for estimating evolutionary rates of
base substitutions through comparative studies of nucleotide
sequences. J. Mol. Evol. 16, 111-120.

Kimura, M. (1983) The Neutral Theory of Molecular Evolution.
Cambridge University Press, Cambridge, England.

Li, W.-H., Wu, C.-I. and Luo, C.-C. (1985) A new method for estimating
synonymous and nonsynonymous rates of nucleotide substitution
considering the relative likelihood of nucleotide and codon changes.
Mol.Biol.Evol. 2, 150-174.

Myers, E.W. and Miller, W. (1988) Optimal alignments in linear space.
CABIOS 4, 11-17.

Pearson, W.R. and Lipman, D.J. (1988) Improved tools for biological
sequence comparison.  PNAS USA 85, 2444-2448.

Saitou, N. and Nei, M. (1987) The neighbor-joining method: a new
method for reconstructing phylogenetic trees.  Mol.Biol.Evol. 4,
406-425.

Sneath, P.H.A. and Sokal, R.R. (1973) Numerical Taxonomy.  Freeman,
San Francisco.

Sokal, R.R. and Michener, C.D. (1958) A statistical method for
evaluating systematic relationships.  Univ.Kansas Sci.Bull. 38,
1409-1438.

Vingron, M. and Argos, P. (1991) Motif recognition and alignment for
many sequences by comparison of dot matrices.  J.Mol.Biol. 218, 33-43.

Wilbur, W.J. and Lipman, D.J. (1983)  Rapid similarity searches of nucleic acid and protein data banks.  PNAS USA 80, 726-730.