Fshift Users Manual

Protein-implied protein optimal sequence alignment program with special treatment of sequencing errors and frameshifts

Authors: Alexander J. Ropelewski and Dr. Hugh B. Nicholas

Copyright (C) 1993-1998 Pittsburgh Supercomputing Center.

The FSHIFT program was developed under the National Institutes of Health NLM grant NIH R01 LM05513 and enhanced under the National Institutes of Health NCRR grant 1 P41 RR06009 and 2 P41 RR06009

This document refers to Fshift V1.1



  1. Introduction
  2. General Program Usage
  3. Selecting Program Parameters
  4. Analysis of Results
  5. Appendices

Introduction

The process of transcribing a DNA sequence into a protein sequence is often complicated when a shift in the reading frame occurs. These shifts can be caused by sequencing errors as well as a variety of biological reasons Some of the common causes of frame shifts are:

  • The reading frame shift is an artifact of an accidental insertion or deletion of a base by the researcher (or automated software) performing the sequencing. Experimental errors often complicate the identification of the sequence's function. Estimates of experimental errors range from 1-6%.
  • The reading frame shift is a result of mutation; when the number of bases inserted into or deleted from the coding region of a nucleic acid sequence is not divisible by three. This type of mutation causes the protein reading frame to change for all sequence elements downstream of the insertion or deletion.
  • The reading frame shift occurred at a slippery site on the mRNA sequence. The resulting protein sequence may have been created through the bypassing of a single, a few, or several nucleotides. Although mRNA slippage may lead to a truncated protein, a functional protein is often the result of the slippage.
  • The shift in reading frame is caused by joining of exons that are separated by an intron region whose base length is not divisible by three.

Unfortunately, most standard database searching techniques (Smith and Waterman 1981, Altschul et. al 1990, Pearson et. al 1990) do not explicitly take reading frame shifts into consideration. Many methods present only alignments that score above a certain threshold. Often, small frame shifted segments occur in poor quality regions of one pass, automatically read sequences. Many of these short segments score below the threshold used by the programs.

Fshift is a dynamic programming sequence alignment algorithm that explicitly incorporates frameshifts in a biologically realistic manner. The program takes a protein query sequence and compares it with a translated DNA library sequence. The translated DNA library sequence is examined in the three forward reading frames, looking for a frame-shifts that are primarily due to laboratory sequencing errors. The protein alignment presented is allowed to jump from frame to frame at the cost of a penalty that reflects the infrequency of frameshift mutations.

General Program Usage

 Fshift -q query_file -l library_file -o output_file -s pam_distance 
        -g gap_length_penalty -n newgap_penalty -f shift_penalty
        -d2 d2_shift_penalty -d1 d1_shift_penalty -i1 i1_shift_penalty
        -i2 i2_shift_penalty -e expect -c cutoff 
        -p number_of_traceback_paths [-msp]
  • Use the -q option to enter the query filename. For example, If you wanted to use the query file query.dat, Enter: "Fshift -q query.dat" The program is designed accept sequence files in the NBRF-PIR, GenBank, EMBL, and Swiss-Prot individual sequence formats.
  • Use the -l option to enter the library filename. For example, If you wanted to use the library file seq.dat, Enter: "Fshift -l seq.dat" The program is designed accept sequence files in the NBRF-PIR, GenBank, EMBL, and Swiss-Prot individual sequence formats.
  • Use the -o option to send the programs output to a file. For example, If you wanted to send output to the file out.dat, Enter: "Fshift -o out.dat"
  • Use the -s option to set the PAM scoring matrix. This version can make use of PAM matricies at an evolutionary distance of between 10-500. For example, If you wanted to set the scoring matrix to PAM 120, Enter: "Fshift -s 120"
  • Use the -g option to set the gap length penalty.This number is divided by 100. For example, If you wanted to set the gap length penalty to 5, Enter: "Fshift -g 500"
  • Use the -n option to set the new gap penalty. (Sometimes called an open gap penalty) This number is divided by 100. For example, If you wanted to set the new gap penalty to 8, Enter: "Fshift -g 800"
  • Use the -f option to set a fixed frame shift penalty (ie d1 = d2 = i1 = i2) This number is divided by 100. For example, If you wanted to set this penalty to 2, Enter: "Fshift -f 200" If d1, d2, i1 and/or i2 penalities are used with this option, the -f penalty should be precede the individual penalities on the command line.
  • Use the -d1 option to set the delete one base frame shift penalty. This number is divided by 100. For example, If you wanted to set this penalty to 2, Enter: "Fshift -d1 200" Experimentally this kind of event is common in the sequence databases.
  • Use the -d2 option to set the delete two base frame shift penalty. This number is divided by 100. For example, If you wanted to set this penalty to 2, Enter: "Fshift -d2 200". Experimentally this kind of event is rare in the sequence databases.
  • Use the -i1 option to set the insert one base frame shift penalty. This number is divided by 100. For example, If you wanted to set this penalty to 2, Enter: "Fshift -i1 200". Experimentally this kind of event is common in the sequence databases.
  • Use the -i2 option to set the insert two base frame shift penalty. This number is divided by 100. For example, If you wanted to set this penalty to 2, Enter: "Fshift -i2 200" Experimentally this kind of event is rare in the sequence databases.
  • Use the -e option to enter an expect value.
  • Use the -c option to enter a cutoff value. This value is set to 10% of the maximum score, unless if this option is used. Scores > this value will not be reported. For example To set this value to -100, Enter: "Fshift -c -100"
  • Use the -p option to set the maximum number of tracebackpaths. To speed searches, this value is set to 25 by default. For example, to set this value to 100, Enter: "Fshift -p 100"
  • Use the -msp option to perform an initial scan using a maximal segment pair algorithm. This will increase the performance of the code, but decrease the sensitivity.

Selecting Program Parameters

Scoring parameters for sequence alignments have been discussed for years. Most of the discussion on scoring parameters has centered around replacement matrix choice(Dayhoff et al 1978, Altschul 1991, Henikoff and Henikoff 1992, Altschul, 1993) and the selection of appropriate gap penalties (Fitch and Smith 1983, States et. al. 1991, Vingron and Waterman, 1994). Although some sets of parameters work better than others for general cases, there is unfortunately, no single set of scoring parameters that can be used to find every biologically significant alignment. Our program uses amino acid PAM matrices at a variety of distances(See Altschul 1991). For advice on selecting appropriate gap values see Fitch and Smith (1983) or Vingron & Waterman (1994).

Finding the proper shift penalty to use is a subject that needs further research. Studies show that the rate of sequencing errors is between one and six percent (Kristensen, et. al. 1992, Krawetz, 1989). Although the total error rates may be within this range, we found that the location of most sequencing errors are within a dense window. The error rates within these dense windows often exceeds the 6% rate mentioned above. For example, we have found that sequencing errors in poor data quality regions usually exceed one error per ten bases (10%).

Our experience shows that for database searching, using a shift penalty equal to the weight of a single gap usually produces adequate results for frameshifts found in high quality data regions. However, if one suspects sequencing errors in a poor data quality region of a single EST sequence, a lower shift penalty may produce better results. Our results also indicate that with EST sequences, using different penalities for -i1, -i2, -d1, -d2 produce better results with penalities for -i2 and -d2 set to twice the -i1 and -d1 penalties.

Analysis of Results

To help analyze the resulting alignments, our program breaks down the alignments into segment pairs that have a length of at least 10. These segment are the segment pairs that can be joined together by inserting gaps to yield an optimal alignment. Statistical probabilities can be calculated on these segment pairs according to the methods of Karlin and Altschul (1990). These probabilities are presented to the user, along with the score of the segment pair. Segment pair statistics provide useful information to help the user distinguish between random alignments and biologically significant alignments.

The statistics presented should be used only as a guide for the accompanying alignment. Many biologically significant alignments contain segment pairs that when examined individually do not yield statistical significance. Karlin and Altschul (1993) discuss summing segment pair probabilities to generate an overall statistical significance. Although this is appropriate for all segment pairs in a single frame alignment, caution should be used when joining regions accross different frames. In poor data quality regions, statistical probabilities were usually not a useful way to demonstrate that there was a genuine sequencing error. However, for some cases, statistics provided useful evidence.

At the end of the run, the program will produce a histogram of the alignment scores. In most cases, the scores should follow an extreme value distribution. An example distribution is shown below.

Histogram of scores:

  Plot of scores VS the log of the number of times the score occured.

 6.85-|  *                                                                   
 6.49-|  *                                                                   
 6.13-|  *                                                                   
 5.77-|  *                                                                   
 5.41-|  **                                                                  
 5.04-| ***                                                                  
 4.68-| ***                                                                  
 4.32-| ***                                                                  
 3.96-| ***                                                                  
 3.60-| ***                                                                  
 3.24-| ***                                                                  
 2.88-| ***                                                                  
 2.52-| ****                                                                 
 2.16-| ****                                                                 
 1.80-| ****                                                                 
 1.44-| ****                                                                 
 1.08-| ****                                                                 
 0.72-| *****                                                                
 0.36-| *****     *                                                          
 0.00-| *****     *                                                          
      |----------------------------------------------------------------------
       ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
       0     C                                                           -810
                           Scores (xunits=11.74)

The scores to the right of the main distribution are significant. However, this significance does not necessarily imply that the underlining alignments are correct with respect to frameshifts. The alignments should be examined carefully along with the original trace data to resolve discrepencies.

Appendicies

Nucleotides

              NAME                    CODE      MEANING

              Adenine                 A         A
              Cytosine                C         C
              Guanine                 G         G
              Thymine/Uracil          T/U       T/U
                                      M         A or C
                                      R         A or G
                                      W         A or T/U
                                      S         C or G
                                      Y         C or T/U
                                      K         G or T/U
                                      V         A or C or G
                                      H         A or C or T/U
                                      D         A or G or T/U
                                      B         C or G or T/U
                                      X/N       A or C or G or T/U


Amino Acids

              NAME                    3 LETTER  CODE

              Alanine                 Ala       A
              Cysteine                Cys       C
              Aspartic Acid           Asp       D
              Glutamic Acid           Glu       E
              Phenylalanine           Phe       F
              Glycine                 Gly       G
              Histidine               His       H
              Isoleucine              Ile       I
              Lysine                  Lys       K
              Leucine                 Leu       L
              Methionine              Met       M
              Asparagine              Asn       N
              Proline                 Pro       P
              Glutamine               Gln       Q
              Arginine                Arg       R
              Serine                  Ser       S
              Threonine               Thr       T
              Valine                  Val       V
              Tryptophan              Trp       W
              Tyrosine                Tyr       Y
              Aspartic/Asparagine     Asp,Asn   B
              Glutamic/Glutamine      Glu,Gln   Z
              Unknown                 Xxx       X




Pittsburgh Supercomputing Center National Resource for Biomedical Supercomputing
An NIH Supported Resource Center
300 S. Craig St. Pittsburgh, PA 15213. Phone: 412-268-4960, Email: biomed@psc.edu

Please send suggestions for improving this code and error reports to biomed@psc.edu.

biomed-www@psc.edu (last updated: 2/17/98)
© 1998 Pittsburgh Supercomputing Center