GeneDoc User Manual

Running and Using GeneDoc

GeneDoc installs as a regular Windows program, into a GeneDoc program group. After starting GeneDoc, use file open and read in a MSF (Multiple Sequence File) file. GeneDoc saves configuration information for these files in the comment section of the MSF, so if the file was saved by GeneDoc, then it will be reopened with the same settings it was last saved with.

Reading/Importing Data

You can import non MSF files. Use the File/New menu, then select File/Import. This import dialog allows import from the Clipboard, Disk Files or manual input. Clustal, fasta and a few other types can be read this way.

GeneDoc Web Viewer

GeneDoc supports a command line argument as a file name. This allows GeneDoc to be run from another program, such as a web browser or database program.

DDE support

GeneDoc provides minimal DDE support, File Open and Print. This allows you to set up Windows to have GeneDoc open or print files automatically by double clicking or right clicking on a MSF file icon.

GeneDoc Display Modes

GeneDoc has several view modes. These are accessed through the Windows menu or Project toolbar. The Alignment view, Summary view and Tree view can be opened and used at any time. The Report view and Plot view will be empty or not able to open if you have not selected a report to show or a plot to view.

Alignment View

When GeneDoc is first opened, you will see the Alignment view. The Alignment view should be considered the primary view of GeneDoc. In this view, you will be able to see the properties of each residue in the alignment with the various shading modes, you will be able to arrange the residues to improve your alignment. You will be able to score various sections on the alignment and add manual comments and manual shading.

Summary View

A useful view is the Summary view. This view shows you the presence or absence of a residue, but not its value, thus the alignment is viewed in a compressed fashion. GeneDoc will also apply shading to the Summary view. You can control the summary view settings in the Project Configuration dialog. GeneDoc can compress the Summary View up to one dot per column, so on a high resolution printer the amount of compression can be considerable.

Tree View

GeneDoc provides a phylogenetic tree view. This view is can be used to construct the phylogenetic tree with a point and click interface, and the Manage Expression dialog will edit, load and save phylogenetic information in Vermont syntax. GeneDoc does not attempt to compute any phylogenetic information for the alignment, though it will use it to score a selected set of columns. The view is used only to build and view phylogenetic trees.

Report View

GeneDoc creates several text reports, and will show them the Reports View. The Reports view holds the information, but does not save it on the disk. There is a menu entry in the reports menu for saving the report to a text file of your choice.

Plot View

GeneDoc displays a few Cumulative Distribution plots in the Plot view. At its simplest, the plot view can be used to show Percent Identity or Percent Favorable Substitutions for the alignment as a whole. Using the Super Family Groups of GeneDoc, the plot can be used to show that the scoring within groups is significantly better than scoring between different groups, thus demonstrating that your super family groups are well chosen. These functions are controlled through the Plot menu.

Gel View

GeneDoc displays a simple Enzyme Gel Simulation view. This view contains two lists, the list of sequences and the list of loaded enzymes. Select multiple sequences and multiple enzymes and click the ‘Run Gel’ button. The resulting sequence fragments are plotted on the view on a log scale.

Project Settings

GeneDoc has a rich set of Project Configuration settings. While some of these settings are controlled through menus, all of the settings are found in the Configuration Dialog. This dialog is accessed either through the Project menu or the Project toolbar.

Configuration Dialog

The Configuration Dialog holds ten tabs. Each tab holds various GeneDoc settings related to each other and described by the tab title. Tab functions can be put into three groups, Project Setup, Print Control and Shading Control. The first tab, Project, controls font size, consensus lines, alignment blocking and other settings that apply to every display. The Print tab controls printer margins, page headers, footers, numbers and the like. A Shade tab mimics a lot of the entries found in the Shade menu, with a few other settings for the conserved and quantified shading style. There is a scoring tab that allows you to select which DayHoff or PAM scoring tables and substitution groups you want to use. The rest of the tabs are for control of individual shading modes. There is a tab for Properties, Physiochemical properties, Pattern Search, Log Odds, Identities and Structure. All aspects of these display modes are controlled through these configuration tabs. Here is where you change colors, and add, edit and delete patterns or properties, load data files for display modes, whatever. The configuration dialog does not have anything to do with manual sequence arranging, though scoring settings can be controlled here.

Sequence Edit Dialog

The Project menu also holds the Edit Sequences dialog. In this dialog, sequences can be added or imported, deleted. You can Complement, Reverse and Duplicate Sequences here. Comments about the sequences can be entered. Weights can be changed, which are used by the Log Odds displays.

Project Titling Facility

The Project menu also has the Titling Facility. The titling facility gives you a convenient way to enter comments at the top of the MSF file. These comments are not saved in the usual GeneDoc encoded header, but above them in ascii text, so anyone or any program will have access to them.

Save and Load User Defaults

Save and Load User Defaults is a way to save the current settings as GeneDoc’s default settings. These would apply when you open a MSF file that has not been previously saved by GeneDoc. If you want to apply these settings to a file that has GeneDoc settings, then load the file and then use the Load User Defaults, these settings will replace whatever GeneDoc’s current settings are.

Arranging Alignments

Arranging alignments is a primary function of GeneDoc. There are many features provided to help accomplish this task. Both mouse and keyboard operations are supported. GeneDoc’s Grab and Drag arrangement mode allows you to move residues around like beads on a string. You can Slide residues, which only inserts or delete gaps immediately in front of the selected residue, preserving other gap placements. You can insert and delete a gap with either the mouse or the INS and DEL keys. You can insert gaps into every other residue from the one clicked on, and insert columns of gaps. GeneDoc also provides the ability to select groups of sequences to work on. See Tips on Arrangement and Alignment View.

Using The Mouse

Using the mouse for arrangements gives you quick adjustment abilities. Grab and Drag and Slide modes work by clicking on the desired residue and while holding the mouse button down moving the mouse back and forth. Other modes, such as insert single gaps or insert gaps into other sequences, use a single mouse click. You can select columns for certain alignment functions by enabling the column select mode then clicking on the start and end columns to be selected. Moving the mouse over the alignment updates the location indicator in the lower right corner of the GeneDoc window. This will show you what sequence and residue number the mouse is positioned over.

Using the Keyboard

The keyboard is very useful for arrangement of alignments. The keyboard is more precise than the mouse, especially if the mouse is dirty or jumpy. The keyboard can be slower than the mouse, but every effort has been made to help keep use of the keyboard efficient. There are a few shortcut keys that are helpful as well. After you become more familiar with GeneDoc, the keyboard is often a faster way to work with GeneDoc than the mouse. See Tips on Arrangement and Alignment View.

Additional Arrangement Functions

In addition to the movement of residues for arrangement purposes, there are few other helpful functions. You can select a range of columns to be deleted. As mentioned before, you can select a set of sequences to be worked upon. When you select sequences, clicking within the selected set performs work upon the selected set. When you click outside of the selected set, work is performed upon the unselected set.

Copy To/From Other Projects

A rather useful function, for specific cases, is the ability to copy a range of selected columns to another project, edit just that range, then replace them back into the original project. This can be useful for a variety of purposes, from manually arranging a restricted area, to exporting an area of the alignment to be realigned or aligned with a different alignment program.

Editing Functions for DNA Projects

There are a few editing functions specific to DNA projects as well. These include complementing and reversing sequences, copying data between sequences, and creating a fasta style output of the consensus row. A specific use of the Properties mode is very helpful for DNA projects as well. If the configuration dialog has the project set to a non protein type, then Property Shading Mode level 3 has been pre setup to shade each nucleotide in a unique color.

DNA Ambiguity Support

GeneDoc has built in ambiguity support for DNA projects. This support is used in the realm of determining conservation for columns and where ever else is appropriate.

DNA/Protein Translation/ReGap

A DNA to Protein translation feature is incorporated into GeneDoc. You can choose from a set of translation tables or modify one to your own. There are several Frame Options for the translation process. If you make arrangement changes to a Protein alignment, you can ReGap the corresponding DNA project. Using the cylce of Translate/ReGap, you can keep both the DNA and Protein projects in sync.

Scoring while Arranging

GeneDoc provides scoring functions for the alignment view. Scores can be calculated for a selected range, presumably a range of columns you are rearranging, and then have the area recalculated conveniently after you have made arrangement changes. This is intended to give you an objective basis for determining whether your alignment changes are actually better or not. Scoring can be done with Sum of Pairs, Phylogenetic Tree, or Log Odds methods. For larger numbers of sequences, phylogenetic scoring can be rather slow.

Scoring Tables

Several Dayhoff and PAM similarity scoring tables are built into GeneDoc. These tables are used anywhere a score between two residues is needed. For example, while the conserved and quantify display modes depend mostly on the residue counts within a column, ties are broken by determining which of a set scores better against the rest of the column. The scoring tables are also used in the scoring features of GeneDoc.

Similarity Tables

Also, Similarity tables have been built for each scoring tables. These similarity tables represent selections of favorable substitutions of residues for a given scoring table. These scoring tables can be altered and saved in user defaults. They are used in the conserved and quantify display modes. When used, they show conservation of favorable substitutions as well as identities.

Find and Replace Functions

GeneDoc also includes Find and Replace functions. The Find function is the more useful, as the replace is simple and replaces only segments of the same length. The Find function has features for mismatches and insertions and deletions, and you can select which sequences to search. There are also functions for moving the cursor to the found location or shading all found locations.

Manual Editing

You can enter comments, change residues and make manual shading within your alignment. These changes are saved for you. Manual comments can be entered on any non sequence line, and will over write any other characters which may be displayed on those lines, such as column number or consensus. Manual shading is done with the mouse, by selecting a color and clicking and dragging over any residues you want shaded. Comment lines may not be shaded. You can change the values of residues in your alignment as well. These changes will change the data in your MSF file, so you should have good reason for doing this.

Shading Alignments

Shading alignments is the other major function of GeneDoc. There are quite a few shading modes, and with super family group functions, the possibilities are extended a quite a bit. The different shading modes each represent application of different algorithms to the alignment. Each shading mode is controlled in the Configuration Dialog, as previously mentioned. Many of the toolbar buttons of the alignment view control the shading modes. You can also apply differences mode to the alignment view. This mode changes the display of residues, but not the shading. It highlights where residues are different from the consensus of residues for the column.

Preconfigured Shading Modes

There are three shading modes that don’t require and setup by the user, though they can be adjusted. These are Conserved, Quantified, and Physiochemical display modes. The Conserved and Quantified modes are based on the identity of residues, the similarity tables, and the score table for breaking ties in identity. These modes show percentage of conservation within a column in the alignment. Physiochemical display mode is based on the physical and chemical properties of amino acids and identifies where those properties are conserved within and alignment. These groups were originally proposed and presented as a Venn diagram by W.R. Taylor (1986), though GeneDoc has changed them somewhat.

User Configured Shading Modes

Another two other shading modes used by GeneDoc requires some user setup. These are Properties and Identities. Actually, Properties mode is configured for some obvious chemical properties by default, but its expected you would want to setup your own property groups. Properties mode is useful for highlighting sets of amino acids, either conserved or where ever found. For example, you could enter a set of hydrophobic amino acids and have them highlighted. Identities mode compares the identity of one or more sequences to the rest of the alignment, so you have to choose one or more sequences to get any output from this display mode. If more than one sequence is in the chosen set, then the chosen set of sequences must be conserved before they will be compared to the rest of the alignment.

Shading With External Files

There are three shading modes that require external files to be loaded into GeneDoc for them to function. These are Search, Log Odds and Structure modes. Files that need to be loaded would be ReBase or ProSite Files, Meme or similarly formatted Log Odds matrices, and Protein Structure files such as PDB or DSSP files. GeneDoc will reload structure files automatically whenever the alignment is reopened if the file is located in the same disk directory as the MSF file. Search mode is configurable by the user, but you will need to enter search strings in ReBase or ProSite syntax.

Search: ProSite and ReBase files

In Search mode, you load a ReBase or ProSite file and GeneDoc location enzymes or motifs found in your alignment and highlights them. The configuration dialog then allows you to delete or disable patterns, changes coloring, etc. Search mode stores the located patterns in your alignment, so you do not need to keep a copy of the ReBase or ProSite file with the alignment. You can also export a list of found Enzymes in ReBase syntax for your alignment.

Log Odds: Meme files

In Log Odds mode, motifs described by Log Odds arrays, such as are produced by the Meme program maintained at San Diego Supercomputing Center (www.sdsc.edu/MEME), are loaded into GeneDoc and used to highlight your alignment. Again, the Configuration Dialog contains several options for controlling usage of the motifs and shading of the alignment. You can then shade the alignment based on the motifs found in the Log Odds file.

Secondary Structure Files

Structure shading mode uses several different external file types. A file type can contain more than one set of information. PDB file can be read in for example, and the secondary structure information applied to the sequence in the alignment that corresponds to the PDB file. DSSP file, from EMBL can be read as well. Some EMBL sponsored programs that predict secondary structure can be read in and the prediction or other information, such as prediction probability, can be applied to the appropriate sequence as well.

User Defined Files

While this shading mode is called Structure mode, any information can be used, such as accessibility. GeneDoc reads in PSDB files, which are derived from the PDB information, and contain quite a large set of calculations that may be applied to a sequence in the alignment. More than one file type can be applied to each sequence, and each sequence can have file types applied to it. There is a scheme for allowing the user to define a custom data set and read these user files into GeneDoc for shading. This allows considerable flexibility, though at the expense of some initial effort.

Super Family Group Features

GeneDoc supports Super Family Groups. It does this first by allowing you to define groups of sequences within the alignment as belonging to the same family through the Group Configuration Dialog. In this dialog, you can apply a subset of the the shading modes to each of the groups. For example, you could show conserved shading for each group in the alignment. The Shade Group Configuration function switches you between non group and group shading.

Specialized Super Family displays

Additionally, there are a number of shading functions specialized for super family groups. These functions typically contrast scores or identities between the groups or to shade across all groups before shading within groups. These specialized shading modes can be specific to either Protein or DNA alignments as well.

Plots

Some analysis functions are available that plot the results in graphical format. These tell about inter sequence identities or scores in graphical format. This can be convenient for alignments with a lot of sequences, since the equivalent report would be too large to be read or printed easily. GeneDoc will also use its score tables to show you favorable substitution levels as well as identity levels.

The DStat function

You can do more a complicated super family group analysis, using statistical functions to show the validity of your groupings. After the super family groups have been setup, then you can select a range and make a DStat plot. You will get two lines on your plot, one for scores within groups and one for score between groups. If your groups are properly selected, the will show statistical differences.

Reports: Stats, Score, Composition

GeneDoc provides a few reports for your alignment. A common Statistics report is given, though GeneDoc also includes favorable substitutions as determined by the current scoring table in this report. GeneDoc shows a score report, which gives scores between sequences in your alignment. There is a Base Composition report. There is a specialized enzyme report based on the sequence fragments identified by using the ReBase Search mode that sorts fragments lengths, finds unique ones, and identifies unique fragment lengths across super family groups. You can select a range of columns and compute a Log Odds matrix for it.

Exporting and Copying Figures

GeneDoc provides a lot of ways to write shaded (or non shaded) alignments to the clipboard or a file for use in other programs genedoc_output. You must first select a block within your alignment. GeneDoc typically breaks up the alignment into blocks and displays them vertically on the screen for scrolling, and after you select one or more blocks you can then copy or export menu_copy them. Export types include RTF, PICT , Meta Files, HTML, Bitmap and Text. Every attempt to include all shading and other display features has been included, though of course formats like text won’t support such.

Getting postscript file output

There is no built in support for PostScript. You should find it convenient enough to load a postscript driver and set the output to a file for creating PostScript files genedoc_postscript.

Toolbars

GeneDoc puts many common functions into Toolbars. There are a couple sets of toolbars. The two most common are the Project Toolbar and the Alignment View Toolbar.

Project Toolbar

The Project Toolbar is the upper toolbar and is always visible regardless of which view you are showing. This toolbar shows functions that are applicable for whatever view, such as file save and print. The Project Toolbar has buttons for the Configuration Dialog, the Edit Sequence List Dialog and the Group Configuration Dialog. The Project Toolbar also has buttons for switching between the various views of GeneDoc.

Alignment Toolbar

The Alignment Toolbar controls features of the Alignment view. The Alignment view is where most of the work gets done in GeneDoc. Mainly, buttons here control the shading menu_shade and arranging menu_arrange features of the alignment view. This toolbar is only visible when the alignment view is active.

Tree Toolbar

The Tree View Toolbar has a few buttons for use in constructing phylogenetic relationships with the GeneDoc interface. GeneDoc provides a GUI interface for creating and deleting nodes of the tree.

Tips on Using the Toolbar

The toolbar can be used to access the configuration dialog in a quick fashion. If you click on a display mode, such as conserved, then GeneDoc will switch to that shading mode. If you click on the conserved toolbar button again, the configuration dialog will be opened with the conserved setup tab selected. See Toolbar Tips toolbar_tips.

RasMol Scripts

GeneDoc has the ability to create simple scripts for the RasMol program. These scripts can be imported into RasMol and used to color the molecule with the shading done in GeneDoc. This is useful for coloring molecules in ways RasMol does not support, or applying the information of an alignment to the molecule for visualization purposes.

GeneDoc’s Similarity Tables

When Similarity Groups are enabled, GeneDoc assigns amino acids to substitution groups, groups of amino acids that are treated as if they are equivalent to each other, for the purpose of measuring the degree conservation in each column of the alignment. We have attempted to place the selection of members of each equivalence group on an objective and rational basis. Thus the members of each equivalence group are a set of amino acids that have mutually positive scores in the similarity representation of the scoring matrix selected for use with an alignment. The default scoring matrix is the Blosum 62 matrix and thus the default equivalence groups reflect the scores in this matrix. Note that GeneDoc, in order to properly deal with multiple sequence alignments, uses the similarity matrices in their distance representation. The discussion below is in terms of the similarity representation and the matrices below are shown in the similarity form while those displayed GeneDoc will be in their distance representation. The two forms can be interconverted with the equation:

Distance(i,j) = Maximum Similarity – Similarity(i,j).

Where (i,j) designates the score for a specific pair of amino acids.

The choice of similarity matrix determines both the pattern and the extent of substitutions that will be considered as favorable in evaluating the alignment.

In order to objectively evaluate an alignment one must have a quantification of whether the substitution of one amino acid for another is likely to conserve the physical and chemical properties necessary to maintain the structure and function of the protein or is more likely to disrupt essential structural and functional features of the protein. Numerous bases have been used in creating similarity tables: explicit or implicit (empirical) evolutionary models, structural properties such as Chou-Fasman propensities, chemical properties such as charge, polarity, and shape, as well as combinations like those used in the Structural-Genetics Matrix. Regardless of the underlying bases, all similarity tables are attempts to quantify whether a mutation preserves or disrupts the function of a protein.

Similarity scores used in GeneDoc are based on observed substitutions of one amino acid (or nucleotide) for another in homologous proteins or genes. Similarity scores organize the observations into scores that contrast the observed pattern of substitutions in homologous proteins with the random pattern of substitutions we would expect to observe in unrelated proteins. Modern similarity scores, computed as log-odds scores, have been shown to be the most efficient way to use the observed substitution data to detect homologous sequences.

If the replacements are favored during evolution (i.e. a conservative replacement) the similarity score will be greater than zero and if there is selection against the replacement (i.e., a nonconservative replacement) the similarity score will be less than zero. Thus similarity scores above zero indicate that two amino acids replace each other more often during evolution than we would expect if the replacements were random. Likewise, similarity scores below zero indicate that amino acids replace each other less often than we would expect if the replacements were random.

Differences in the way replacements are counted is one of the biggest differences between the two most widely used families of similarity matrices, the PAM matrices and the more recently developed Blosum matrices. The PAM matrices use counts derived from an explicitly tree like, branching evolutionary model. The Blosum matrices use counts directly derived from highly conserved blocks within an alignment.

Margaret Dayhoff and her co-workers performed the first careful, systematic study to create the first amino acid similarity matrix, the Point Accepted Mutation (PAM) similarity matrix. In computing the PAM matrices the alignment was created from a limited set of closely related sequences. The alignment was a global alignment, that is, it encompassed the entire length of the sequences. Thus both highly conserved regions and highly variable regions are included in the alignments and used in counting replacements.

The PAM 250 matrix, originally created by Margaret Dayhoff is shown below. This matrix is appropriate for searching for alignments of sequence that have diverged by 250 PAMs, 250 mutations per 100 amino acids of sequence. Because of back mutations and silent mutations this corresponds to sequences that are about 20 percent identical.

PAM 250 Amino Acid Similarity Matrix

C 12  G -3  5  P -3 -1 6  S  0  1  1  1  A -2  1  1  1  2  T -2  0  0  1  1  3  D -5  1 -1  0  0  0  4  E -5  0 -1  0  0  0  3  4  N -4  0 -1  1  0  0  2  1  2  Q -5 -1  0 -1  0 -1  2  2  1  4  H -3 -2  0 -1 -1 -1  1  1  2  3  6  K -5 -2 -1  0 -1  0  0  0  1  1  0  5  R -4 -3  0  0 -2 -1 -1 -1  0  1  2  3  6  V -2 -1 -1 -1  0  0 -2 -2 -2 -2 -2 -2 -2  4  M -5 -3 -2 -2 -1 -1 -3 -2  0 -1 -2  0  0  2  6  I -2 -3 -2 -1 -1  0 -2 -2 -2 -2 -2 -2 -2  4  2  5  L -6 -4 -3 -3 -2 -2 -4 -3 -3 -2 -2 -3 -3  2  4  2  6  F -4 -5 -5 -3 -4 -3 -6 -5 -4 -5 -2 -5 -4 -1  0  1  2  9  Y  0 -5 -5 -3 -3 -3 -4 -4 -2 -4  0 -4 -5 -2 -2 -1 -1  7 10  W -8 -7 -6 -2 -6 -5 -7 -7 -4 -5 -3 -3  2 -6 -4 -5 -2  0  0 17     C  G  P  S  A  T  D  E  N  Q  H  K  R  V  M  I  L  F  Y  W     

The PAM 250 matrix above has been arranged so that similar amino acids are close to each other. This gives rise to regions along the diagonal of the matrix that contain only positive scores. These regions provide an objective basis for defining conservative substitutions, namely as amino acids that replace each other more frequently than would be expected from random replacements. Note that the amino acids that make up these regions can change at different levels of sequence divergence, that is, different similarity scores matrices correspond to different sets of conservative substitutions. The diagonal terms of the matrix vary appreciably. This variation reflects both how often an amino acid is found in protein sequences and how often it is observed to be replaced by other amino acids. Thus rare amino acids which are replaced infrequently have the highest scores.

The Blosum Family of Matrices

There are three principal differences between the Blosum and PAM matrices. The first difference is that the PAM matrices are based on an explicit evolutionary model (that is, replacements are counted on the branches of a phylogenetic tree), whereas the Blosum matrices are based on an implicit rather than explicit model of evolution. The second difference is the sequence variability in the alignments used to count replacements. The PAM matrices are based on mutations observed throughout a global alignment, this includes both highly conserved and highly mutable regions. The Blosum matrices are based only on highly conserved regions in series of alignments forbidden to contain gaps. The last difference is in the method used to count the replacements. The Blosum procedure uses groups of sequences within which not all mutations are counted the same.

Blosum 45 Amino Acid Similarity Matrix

G  7  P -2  9  D -1 -1  7  E -2  0  2  6  N  0 -2  2  0  6  H -2 -2  0  0  1 10  Q -2 -1  0  2  0  1  6  K -2 -1  0  1  0 -1  1  5  R -2 -2 -1  0  0  0  1  3  7  S  0 -1  0  0  1 -1  0 -1 -1  4  T -2 -1 -1 -1  0 -2 -1 -1 -1  2  5  A  0 -1 -2 -1 -1 -2 -1 -1 -2  1  0  5  M -2 -2 -3 -2 -2  0  0 -1 -1 -2 -1 -1  6  V -3 -3 -3 -3 -3 -3 -3 -2 -2 -1  0  0  1  5  I -4 -2 -4 -3 -2 -3 -2 -3 -3 -2 -1 -1  2  3  5  L -3 -3 -3 -2 -3 -2 -2 -3 -2 -3 -1 -1  2  1  2  5  F -3 -3 -4 -3 -2 -2 -4 -3 -2 -2 -1 -2  0  0  0  1  8  Y -3 -3 -2 -2 -2  2 -1 -1 -1 -2 -1 -2  0 -1  0  0  3  8  W -2 -3 -4 -3 -4 -3 -2 -2 -2 -4 -3 -2 -2 -3 -2 -2  1  3 15  C -3 -4 -3 -3 -2 -3 -3 -3 -3 -1 -1 -1 -2 -1 -3 -2 -2 -3 -5 12     G  P  D  E  N  H  Q  K  R  S  T  A  M  V  I  L  F  Y  W  C  

Summary: PAM and Blosum Matrices

The Blosum and PAM matrices are the most widely used amino acids similarity matrices for database searching and sequence alignment. In empirical tests of the effectiveness of the matrices both generally perform well. However, the Blosum matrices have most often been the better performers. This likely reflects the fact that the Blosum matrices are based on the replacement patterns found in more highly conserved regions of the sequences. This appears to be an advantage because these more highly conserved regions are those discovered in database searches and they serve as anchor points in alignments involving complete sequences. It is reasonable to expect that the replacements that occur in highly conserved regions will be more restricted than those that occur in highly variable regions of the sequence. This is supported by the different pattern of positive and negative scores in the two families of matrices. These different patterns of positive and negative scores reflect different estimates of what constitute conservative and nonconservative substitutions in the evolution of proteins. These differences reflect the differences in constructing the two families af matrices. Some of the difference is also likely to be because the Blosum matrices are based on much more data than the PAM matrices.

The PAM matrices still perform relatively well despite the small amount of data underlying them. The most likely reasons for this are the care used in constructing the alignments and phylogenetic trees used in counting replacements and the fact that they are explicitly based on a simple model of evolution. Thus they still perform better than some of the more modern matrices that are less carefully constructed. Both the PAM and Blosum matrices generally perform better than matrices explicitly based on criteria other than observed replacement frequencies.

We can see the concrete result of these differences in the PAM 250 and Blosum 45 matrices shown above. These two similarity tables are directly comparable and are suitable for alignments of sequences the same degree of divergence from each other. They have the same amount of information per alignment position (entropy) for determining whether or not sequences are homologous. Thus differences in these scores should reflect differences in the data and model used in counting substitutions rather than any other effects.

One striking difference is among the amino acids with carboxylate and amide side chains. At physiological pH the carboxylate side chains can only act as a proton acceptor while the amide side chain can simultaneously accept and donate a proton. In the PAM 250 table these four amino acids, Asp, Asn, Glu, and Gln (along with His) form a single conservative substitution group, that is the similarity score for any pair is greater than zero. His, has two nitrogen atoms in its side chain and, like the amide side chains of Asn and Gln, can simultaneously both donate and accept a hydrogen bond. In the Blosum 45 similarity table this single group must be split into two groups: One with the carboxylate amino acids Asp and Glu, and the other with Asn, Gln, and His.

Another noticeable difference is the relationship between Ser and Thr, the alcoholic side chain amino acids, and other small amino acids. In the PAM 250 table any of the three amino acids Gly, Pro, or Ala can be added to the Ser, Thr pair to form a three member conservative substitution group. In the Blosum 45 table Ser and Thr form a two member conservative substitution group with no other possible members. The last difference we will mention is that only Phe and Tyr are members of an aromatic conservative group in the PAM 250 table while Trp is also a member in the Blosum 45 table.

Which Similarity Scores to Use

Similarity scores for sequence alignments perform much better if the similarity scores are based on replacement patterns that correspond to the degree of divergence of the aligned sequences. More information about similarity scores is available in the references below or in the on-line tutorial on sequence database searching at the PSC  (http://www.psc.edu/biomed/TUTORIALS/SEQUENCE/DBSEARCH/tutorial.html)

Altschul, S.F. 1991. “Amino acid substitution matrices from an information theoretic perspective.” Journal of Molecular Biology, 219: 555-665. This paper looks at the PAM and Blosum scoring matrices in the context of information theory and develops guidelines making effective use of the information encapsulated in scoring matrices.

Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C. 1978. “A model of evolutionary change in proteins.” In “Atlas of Protein Sequence and Structure” 5(3) M.O. Dayhoff (ed.), 345 – 352, National Biomedical Research Foundation, Washington. This paper describes the development of the PAM family of protein scoring matrices.

States, D.J., Gish, W., Altschul, S.F. 1991. “Improved Sensitivity of Nucleic Acid Database Search Using Application-Specific Scoring Matrices” Methods: A companion to Methods in Enzymology 3(1): 66 – 77. Scoring matrices for nucleic acid sequence that take into account different levels of sequence divergence and different rates of transversions and transitions.

Steven Henikoff and Jorja G. Henikoff. 1992 “Amino acid substitution matrices from protein blocks.” Proc. Natl. Acad. Sci. USA. 89(biochemistry): 10915 – 10919. This paper describes the calculation of the Blosum family of protein scoring matrices.

M.S. Johnson and J.P. Overington. 1993. “A Structural Basis of Sequence Comparisons: An evaluation of scoring methodologies.” Journal of Molecular Biology. 233: 716 – 738. Comparison of Amino Acid substitution matrices with visual representation of the the important features of the matrices for protein similarity and the differences between them.

Steven Henikoff and Jorja G. Henikoff. 1993. “Performance Evaluation of Amino Acid Substitution Matrices.” Proteins: Structure, Function, and Genetics. 17: 49 – 61. Comparison of Amino Acid substitution matrices.

Karlin, S. and Altschul, S.F. 1990. “Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes” Proc. Natl. Acad. Sci. USA. 87: 2264 – 2268.

GeneDoc Percent Identity Plots

The percent identity plots in GeneDoc are an alternative to tables showing the percentage of sequence residues that are identical in every pair of sequences in the alignment. Such tables have often been published as a way to describe the degree of divergence among the sequences in the alignment. This has been a useful guide to the probable amount of information available in the alignment and robustness of the alignment. More divergent sequences make the alignments more informative but also less robust and reliable.

The data in these plots can be either strict identities or the count conservative or favorable substitutions as if they were also identical residues. These are the same values that GeneDoc computes for the “Stats” or statistics view where they are shown in the upper right triangular part of the table.

This type of plot is called a cumulative distribution function (c.d.f.) or sometimes an empirical cumulative distribution function. In this context empirical refers to the fact that the plot is experimental, or empirical, data rather than derived from a mathematical equation describing a theoretical distribution such as the gaussian distribution. The plot is created by first sorting the data to be plotted into ascending order. Then for each data point you compute the fraction of data points that have the same or a smaller value. The data is then compressed to eliminate multiple points that have the same data value. The point with the highest value for the fraction of data items with the same or lower value is retained during compression. These compressed data points are the plotted as a step function.

The step function has as its horizontal axis the data values being plotted. The vertical axis is the fraction of data points with as small or smaller a data value. Thus it always ranges from 0.0 to 1.0. Plotting the data as a step function requires drawing two line segments for all of the data point except the first. For the first data point we draw a single vertical line from the horizontal axis up to the first data point. For all subsequent points we draw two line segments. The first line segment is a horizontal segment from the data value of the previous point to the data value of the current point. This line segment is drawn at the fractional value for the previous data point. This indicates the function retains this value until the next observed data value is reached. The second line segment is a vertical segment from the fractional value of the previous data point to the fractional value of the current data point. This gives the plot its characteristic appearance of a set of irregular steps.

The plot is related to the more commonly used histogram plot in a straight forward manner. As the name, c.d.f., implies it is the cumulative plot of the same data plotted in the histogram. The c.d.f. can be thought of as the integral of the distribution plotted in the histogram. This leads many users to ask “Well, why can’t I have the histogram, it is easier to look at?” Our response is that c.d.f.s are also easy to look at, you just have not looked at them enough to be comfortable looking at them yet. More importantly, because drawing a histogram of experimental data, an empirical histogram, always requires putting the data into bins. This process of putting the data into bins always results in the loss of some information that is not lost in the c.d.f.. One piece of information that is immediately obvious from the c.d.f. of any distribution is its median. With a histogram the approximate location of the median is easily seen only for smooth, symmetrical, unimodal, and unskewed distributions, such as the normal distribution.

Another important reason for using c.d.f.s rather than histograms is the Kolmogorov-Smirnov (K-S) test for the difference between distributions. The K-S test determines whether two distributions are different from each other and is performed on their c.d.f.s. Thus it can easily be presented in terms of the c.d.f.s, but not in terms of the histograms. GeneDoc uses this test in several places. It is used in conjunction with alignment scores to test whether the groups you have defined are statistically distinct. It does this by comparing the distribution of scores between pairs of sequences within groups with the distribution of scores between pairs of sequence in different groups. If the groups are distinct the scores for pairs of sequences in different groups are larger than the scores for pairs of sequences within the same group.

GeneDoc also uses the K-S test to tell you if the sequences in one group are more diverged from each other than are the sequences in a second group. The K-S test allows GeneDoc to compute a significance probability to tell you how likely the observed difference in the c.d.f.s is to have occurred by chance in groups of the same size. The K-S test is a useful general purpose test for differences between two groups because it measures all aspects of the differences between groups. That is, the K-S test will pick up differences in the shape of the c.d.f.s of the two groups as well as differences in the average of two groups.

More commonly used test such as Students t test and the F test detect differences only in the average or the variance of the groups respectively.

The percent identity plot is also useful for selecting a scoring matrix to use with the alignment. Particularly for protein sequence alignments the scoring matrix is important both for computing alignment scores and for determining which amino acids are included in the equivalence groups defined and used in some GeneDoc functions. The PAM matrices for both proteins and nucleic acids and the Blosum matrices for proteins are tuned for use with groups of sequences that have diverged to a different degree. The right hand column in the table below shows the degree of divergence appropriate for different PAM matrices shown in the middle column. The left hand column identifies the Blosum matrix that supplies the same amount of entropy per position in an alignment. Matrices that show the same amount of entropy per position in an alignment are suitable for sequences of the same degree of divergence.

Comparable Blosum and PAM Scoring Matrices based on Equivalent Entropy

Blosum Table Degree of Identity Entropy of the Blosum Table PAM Value of the Table Entropy of the PAM Table Percent Sequence Identity of the PAM Table
Blosum 90 1.18 PAM 100 1.18 43
Blosum 80 0.99 PAM 120 0.98 38
Blosum 60 0.66 PAM 160 0.70 30
Blosum 52 0.52 PAM 200 0.51 25
Blosum 45 0.38 PAM 250 0.36 20

Group and Secondary Structure Features

Increasingly families of genes and proteins are organized into superfamilies. Database organizations often use a the percent of residue identity as the criterion for distinguishing whether a pair of sequences should be assigned to the same family or different families within a superfamily. Perhaps a more useful criterion is whether or not a gene duplication has taken place in the common evolutionary history of the two sequences. Evolutionary biologists use this classification and refer to homologous sequences that have only speciation events and not have gene duplication events in their common evolutionary history as being orthologous. Homologous sequences that have a gene duplication event in their common evolutionary history are referred to as paralogous. Orthologous genes or proteins generally carry out the same biochemical and physiological functions while paralogous proteins generally carry out similar but related functions. For instance, mammalian myoglobins, which carry oxygen within cells are a orthologous family. They are part of a superfamily that includes the alpha hemoglobins and the beta hemoglobins, both of which are also orthologous families. These three homologous families are part of the same paralogous superfamily, as are other globin genes and proteins.

Families and superfamilies can be organized around functional criteria as well as evolutionary criteria and sequence divergence. Although all sixty plus transfer RNA sequences in E. coli are paralogous with each other they can be grouped together by their twenty amino acid acceptor activities.

The group functions in GeneDoc are designed to allow users to work with and analyze groups based on any of the above criteria or any user determined criteria for dividing a set of sequences into groups. The first step in working with groups is to access the groups configuration dialog. The group configuration dialog can be accessed either by selecting the “edit sequence groups” item on the groups menu or by clicking the groups button, the button marked with an upper case G on the upper toolbar. The group configuration dialog allows the user to allocate the sequences to groups and to select a color to be associated with each group. For the purposes of most of the GeneDoc group analyses sequences that are not explicitly assigned to a group will be treated as if each unassigned sequence constitutes the only member of its own group. These implicit groups will not be analyzed but the sequences will be used in the analyses of the defined groups as “other” groups and thus they will contribute to the analysis.

The group analyses result in different shading for individual groups. These shadings highlight different degrees of different kinds of conservation of residues or properties within and between groups. One analysis, referred to as the Dstat analysis, measures how different the groups are from one another and whether this difference is statistically significant.. The Dstat analysis presents its results as a graph and numerical values. The simplest analysis is performed by the “shade group conserved” entry on the Groups menu. This analysis highlights positions within each group that are completely conserved, that is there is only one residue at that position within the group. This highlighting is done in the color assigned to the group in the group definition dialog. This measurement of conservation within the groups does not take into account any equivalency groups, even if they are active. The second thing this analysis does is to highlight the positions that are completely conserved across all of the groups, that is there is only one residue at that position for all of the sequences in the alignment. This part of the analysis does take the equivalency groups into account if they are in effect. The final action is to compute a consensus sequence based on the entire alignment.

The most useful information derived from this analysis is to identify for the user the regions of the alignment where structural or functional requirements may have been relaxed or eliminated (or alternatively added as the group evolved a new function) for some groups relative to others. For this kind of information to be reliable the conserved groups need to be both large and from a diverse range of organisms. Otherwise the observed conservation may simply be the result of a small data set with highly dependent observations.

A more stringent analysis is performed by the “shade group PCR contrast” entry on the Groups menu. Sites highlighted by this analysis meet two criteria. First is that a single residue is completely conserved within the group. Second this conserved residue does not appear, at that position, in any sequence outside of the group in which it is conserved. This analysis marks unique sequence features of the group that can be useful in defining a group motif and possibly in defining a primer sequence to be used in a polymerase chain reaction (PCR) amplification of the gene.

The “shade group contrast” entry on the Groups menu performs an analysis similar to that of the “shade group PCR contrast” entry. This analysis makes use of the scoring table designated for alignment scoring to divide scores for pairs of amino acids into three classes, positive, negative, and neutral. The positive scores are those that are positive numbers in the similarity scores form of the table. Similarly, the negative scores are those that are negative numbers in the similarity scores form of the table, while the neutral scores have a zero score. The scores are stored in GeneDoc as distance or dissimilarity scores and hence must be converted to the similarity form. This is done by subtracting the score for a pair of sequence residues in the table from a constant called the zero cost distance, stored with the table. Thus the largest distances become negative similarities and small distances become positive similarities. The interpretation of the scoring tables is that positive similarities are conservative substitutions and are favored over random substitutions in the evolutionary process relating the sequences.

The analysis performed by the “shade group contrast” entry on the Groups menu is less restrictive about the degree of conservation within the group than is . All of the sequence residues found at a position within the group are required to have a positive similarity score with each other, and thus to be conservative substitutions. This analysis is, however, more restrictive than is the analysis performed by the “shade group PCR contrast” entry on the Groups menu when dealing with residues outside the group. The residues outside of the group must have a negative similarity score with every residue from within the group, thus they are not allowed to be either conservative or neutral substitutions. An example of using this kind of analysis to study the recognition of transfer RNAs by aminoacyl tRNA synthetase enzyme can be found in McClain and Nicholas, 1987. Nicholas et al., 1987 describes using the contrasts to plan site directed mutagenesis experiments to confirm the analysis of the tRNAs.

The analysis called the Dstat analysis is the Kolmogorov-Smirnov test for the equality of two distributions. The Dstat analysis is accomplished by first selecting a region (or all) of the alignment for use in the test calculations. Then you can either select the analysis under the Dstat menu or click the Dstat tool bar button. The Dstat toolbar button is near the right end of the upper toolbar and is marked by a pair of “S” shaped curves representing the cumulative distributions used in the test. As noted above, the Dstat analysis is a statistical test of whether the groups defined by the user are significantly different from each other.

The first step in the test is to compute an alignment score for each pair of sequences over the region selected by the user. These scores are the partitioned into two distributions. The first distribution is composed entirely from scores where both of the sequences used to compute the score are members of different user defined groups. This is called the between groups distribution. The second distribution is composed entirely from scores where both of the sequences used to compute the score are members of the same user defined group. Note that this includes scores from every group with two or more sequences. This distribution is called the within groups distribution.

These two distributions are plotted as cumulative distributions. That is the score is plotted versus the fraction of the scores in the distribution that are less than or equal to the score being plotted. The Kolmogorov-Smirnov D statistic (Dstat) is the maximum difference between the two distributions (along the fractional axis). Recent advances in the understanding of the distribution of values taken on by Dstat allow us to compute its one-sided significance probability. The one-sided significance probability is used rather than the two-sided significance probability because we are only interested in the case where the between groups distribution is composed of larger scores than the within groups distribution. The other situation, where the within groups distribution is composed of larger scores than the between groups distribution corresponds to either convergent evolution or some sort of selection in favor of divergence, situations that are not usually part of the hypothesis.

The Kolmogorov-Smirnov test was selected instead of the more common Student’s t test or the F test because it is sensitive to both the location of the distributions along the scores axis and to the shape of the distribution. Student’s t test is sensitive to only the location of the distributions and the F test is sensitive only to differences in the variance of the distribution, only one of several aspects affecting the shape of the distributions. Thus the Kolmogorov-Smirnov test can find the distributions to be different when either Student’s t test or the F test might have failed. Because of this it is necessary for the user to examine the plot carefully to determine the exact nature of the differences in the two distributions being tested. The user should exercise care that the biological hypothesis being examined should lead to the type of difference actually observed.

Examples of testing biological hypotheses with sequence data and the Kolmogorov-Smirnov test can be found in Nicholas and Graves, 1983 and in Nicholas and McClain, 1995. The Nicholas and Graves paper contains an extended discussion of formulating Kolmogorov-Smirnov tests that correspond to different kinds of biological hypotheses.

Nicholas, H.B. Jr., and Graves, S.B. 1983. Clustering of transfer RNA by cell type and amino acid specificity. Journal of Molecular Biology, vol. 171, pp. 111 – 118.

Nicholas, H.B., Jr., Chen, Y-M., and McClain, W.H. 1987. Comparisons of transfer RNA sequences. Computer Applications in the Biosciences, vol. 3, p. 53.

McClain, W.H. and Nicholas, H.B,Jr. 1987. Discrimination between transfer RNA molecules. Journal of Molecular Biology, vol. 194, pp. 635 – 642.

Nicholas, H.B. Jr. and McClain, W.H. 1987. An algorithm for discriminating transfer RNA sequences. Computer Applications in the Biosciences, 3, pp. 177 – 181.

Nicholas, H.B. Jr. and McClain, W.H. 1995. Searching tRNA Sequences for Relatedness to Aminoacyl tRNA Synthetase Families. Journal of Molecular Evolution, vol. 40, pp. 482-486.

Secondary Structure Features

The structure groups and shading facility provide an extremely powerful and flexible set of tools for integrating sequence information with structural information. The facility is flexible enough to allow the user to display almost any kind of information as color codes along the sequence. Such states can include the obvious secondary structure state of the residue in the three dimensional structure. Less obvious properties like the solvent accessible surface area of the residue or its side chain can also be displayed. The fraction of the side chain in a polar environment is another characteristic that is sometime informative.

One of the most powerful kinds of integration of structure and sequence information allows the user to visually examine the variation in structure or some structural property that occurs as the sequence varies in a series of homologous proteins. This same display allows the user to adjust the alignment based on information derived from the varying structures of a series of homologous proteins.

Alternatively, you can have several copies of the same sequence in the alignment by adding additional copies with different names through the sequence import facility. Remember that each copy must have its own distinct name. These copies of the same sequence can all be highlighted using a different property. This allows you to easily visualize possible correlation of properties or of properties with sequence or structure.

Another use for multiple copies of a single sequence in the alignment is to contrast predictions of structure and properties with that observed in an experimentally determined three dimensional structure.

A wide spread use of the structure shading is to project the structure or properties from a sequence of known structure onto sequences whose structures have not be experimentally determined. This is accomplished by combining the structure shading facility with the group facility. At the same time you use the structure shading dialogue to associate a structure file with a specific sequence in the alignment you can designate that sequence as the master sequence of a structural group. This allows you to associate several sequences with a single structure file. These associated sequences will be shaded with the same colors in the same column of the alignment as the master sequence regardless of the sequence residue present in thhat position of the sequence. Thus this is essentially a low resolution homology modeling facility.

All of these features can be combined into a very informative display for studying structure-function relationships in the following procedure. First, associate a structure file with each sequence in your alignment whose structure has been determined by X-ray crystallography or NMR. Make these sequences the master sequence for a group of the most closely related sequences in the alignment. Ideally the sequences within each group should have common biochemical properties such as substrate specificity, while different groups can have different substrate specificities. Use the sequence editing facility to put sequences in the same group on adjacent rows of the alignment. Put the group master sequence at the top of each group. Then set the display mode to differences mode and make sure this is applied to all of the groups.

This yields a display with all of the group master sequences, that is the sequences with known structures, displayed with all of their residues. The other sequences in each group have a dot displayed where they have the same sequence residue as the group master. Sequence residues that are different from the group master sequence are shown. This highlights substitutions within each group that are presumably successful site directed mutagenesis experiments performed by nature. Differences between groups may be associated with the change in substrate specificity or other property.

It can be very helpful to have exactly the same alignment in a second window highlighted in group contrast mode. This combination of displays can be a very powerful tool for examining structure-function relationships by integrating a large amount of information in an easy to comprehend format and presentation.