FASTA/TFASTA/FASTX/TFASTX Users Manual

Disclaimer

Copyright 1988, 1991, 1992, 1993, 1994 1995, by William R. Pearson and the University of Virginia. All rights reserved. The FASTA program and documentation may not be sold or incorporated into a commercial product, in whole or in part, without written consent of William R. Pearson and the University of Virginia. For further information regarding permission for use or reproduction, please contact:

David Hudson Assistant Provost for Research University of Virginia P.O. Box 9025 Charlottesville, VA 22906-9025 (804) 924-6853



NAME

fasta3(_t) - scan a protein or DNA sequence library for similar sequences

tfasta3(_t) - compare a protein sequence to a DNA sequence library, translating the DNA sequence library `on-the-fly'.

fastx3(_t) - scan a protein database using a translated DNA query

tfastx3(_t) - scan a DNA database, translating the DNA sequence `on-the- fly' and allowing frameshifts

SYNOPSIS

fasta3 [-a -A -b # -c # -d # -E # -f # -g # -H -i -k # -l file -L FASTLIBS -r STATFILE -m # -o -O file -p # -Q -s SMATRIX -w # -x "# #" -y # -z -1 ] query-sequence-file library-file [ ktup ]

fasta3 [-QaAbcdEfgHiklmnoOprswxyz] query-file @library-name-file

fasta3 [-QaAbcdEfgHiklmnoOprswxyz] query-file "%PRMVI"

fasta3 [-aAbcdEgHlmnoOprswyx] - interactive mode

fastx3 [-aAbcdEfghHilmnoOprswyx] DNA-query-file protein-library [ ktup ]

tfasta3 [-aAbcdEfgkmoOprswy3] protein-query-file DNA-library [ ktup ]

tfastx3 [-abcdEfghHikmoOprswy3] protein-query-file DNA-library [ ktup ]

DESCRIPTION

fasta3 is used to compare a protein or DNA sequence to all of the entries in a sequence library. For example, fasta3 can compare a protein sequence to all of the sequences in the NBRF PIR protein sequence database. fasta3 will automatically decide whether the query sequence is DNA or protein by reading the query sequence as protein and determining whether the `amino- acid composition' is more than 85% A+C+G+T. fasta3 uses an improved ver- sion of the rapid sequence comparison algorithm described by Lipman and Pearson (Science, (1985) 227:1427) that is described in Pearson and Lipman, Proc. Natl. Acad. USA, (1988) 85:2444. The program can be invoked either with command line arguments or in interactive mode. The optional third argument, ktup sets the sensitivity and speed of the search. If ktup=2, similar regions in the two sequences being compared are found by looking at pairs of aligned residues; if ktup=1, single aligned amino acids are exam- ined. ktup can be set to 2 or 1 for protein sequences, or from 1 to 6 for DNA sequences. The default if ktup is not specified is 2 for proteins and 6 for DNA.

fasta3 compares a query sequence to a sequence library which consists of sequence data interspersed with comments, see below. Normally fasta3, fastx3, ttfasta3, and tfastx3 search the libraries listed in the file pointed to by the environment variable FASTLIBS. The format of this file is described in the file FASTA.DOC. ttfasta3 and tfastx3 compare a protein sequence to a DNA sequence database, translating the DNA sequence library in 6 frames `on-the-fly' (3 frames with the -3 option). The search uses the standard BLOSUM50 scoring matrix, and uses a ktup=2 by default. tfasta3 searches a DNA sequence database in the standard text format described below. tfastx3, like tfasta3, compares a protein sequence to a DNA sequence library. However, tfastx3 compares the protein sequence to the forward and reverse three-frame translation of the DNA library sequence, allowing for frameshifts. fastx3 compares a DNA sequence to a protein sequence database, translating the DNA sequence in three frames and allowing frameshifts in the alignment. fasta3, fastx3, and tfasta3 report only the best alignment between the query sequence and the library sequence.

The fasta3 programs use a standard text format sequence file. Lines begin- ning with '>' or ';' are considered comments and ignored; sequences can be upper or lower case, blanks,tabs and unrecognizable characters are ignored. fasta3 expects sequences to use the single letter amino acid codes. With version 3, the programs can also read query sequences in common GCG sequence file formats. In addition, one can specify a search with a subset of a query sequence:

Library files for fasta3 should have one of the forms shown below.

The _t versions of the programs are threaded and will perform searches in parallel on multiprocessor machines supporting pthreads.

OPTIONS

fasta3 and the other programs can be directed to change the scoring matrix, search parameters, output format, and default search directories by enter- ing options on the command line (preceeded by a `-' or `/' for MS-DOS). All of the options should preceed the file name and ktup arguments). Alter- nately, these options can be changed by setting environment variables. The options and environment variables are:

-
when used as the query sequence file name, accepts the query sequence from STDIN. When this option is used, the program cannot automatically distinguish DNA from protein, so -n is required to specify a DNA query sequence with -.

-1
Normally, the top scoring sequences are ranked by the z-score based on the opt score. To rank sequences by raw scores, use the -z -1 option. With the -1 option, sequences are ranked by the z-score based on the init1 score.

-a (SHOWALL)
Modifies the display of the two sequences in alignments. Normally, both sequences are shown only where they overlap (SHOWALL=0); If -a or the environment variable SHOWALL = 1, both sequences are shown in their entirety.

-A
Force use of unlimited Smith-Waterman alignment for DNA fasta3 and tfasta3. By default, the program uses the older (and faster) band- limited Smith-Waterman alignment for DNA fasta3 and tfasta3 align- ments.

-b #
The number of similarity scores to be shown when the -QQ option is used. This value is usually calculated based on the actual scores.

-c # (OPTCUT)
The threshold for optimization with the option. The OPTCUT value is normally calculated based on sequence length.

-d #
The number of alignments to be shown. Normally, fasta3 shows the same number of alignments as similarity scores. By using fasta3 -Q -b 200 -d 50, one would see the top scoring 200 sequences and alignments for the 50 best scores.

-E #
The expectation value threshold for displaying similarity scores and sequence alignments. fasta3 -Q -E 2.0 would show all library sequences with scores expected to occur no more than 2 times by chance in a search of the library.

-f #
Penalty for the first residue in a gap (-12 by default for fasta3 with proteins, -16 for DNA).

-g #
Penalty for additional residues in a gap (-2 by default for fasta3 with proteins, -4 for DNA).

-h #
(fastx3, tfastx3 only) penalty for a +1 or -1 frameshift.

-H
Do not display histogram of similarity scores.

-i
(fasta3, fastx3) search with the reverse-complement of the query DNA sequence.

-k # (GAPCUT)
Sets the threshold for joining the initial regions for calcu- lating the iinniittnn score.

-l file
(FASTLIBS) The name of the library menu file. Normally this will be determined by the environment variable FASTLIBS. However, a library menu file can also be specified with -l.

-L
display more information about the library sequence in the alignment.

-m # (MARKX) =0,1,2,3,4,5,6,10.
Alternate display of matches and mismatches in alignments. MARKX=0 uses ":","."," ", for identities, consevative replacements, and non-conservative replacements, respectively. MARKX=1 uses " ","x", and "X". MARKX=2 does not show the second sequence, but uses the second alignment line to display matches with a "." for identity, or with the mismatched residue for mismatches. MARKX=2 is useful for aligning large numbers of similar sequences. MARKX=3 writes out a file of library sequences in FASTA3 format. MARKX=3 should always be used with the "SHOWALL" (-a) option, but this does not completely ensure that all of the sequences output will be aligned. MARKX=4 displays a graph of the alignment of the library sequence with repect to the query sequence, so that one can identify the regions of the query sequence that are conserved. MARKX=5 combines MARKX=0 and MARKX=4, so that graphs and the alignments themselves are displayed. MARKX=6 is similar to MARKX=5, except that html (hypertext markup language) tags are added to aid in navigating the output with a WWW browser. MARKX=10 is used to produce a parseable output format.

-n
Forces the query sequence to be treated as a DNA sequence.

-O filename
send copy of results to "filename."

-o
Turns off default fasta limited optimization on all of the sequences in the library with initn scores greater than OPTCUT. This option is now the reverse of previous versions of fasta3.

-Q
Quiet option. This allows fasta3 and tfasta3 to search a database and report the results without asking any questions. fasta3 -Q file library > output can be put in the background or run at a later time with the unix 'at' command. The number of similarity scores and alignments displayed with the -Q option can be modified with the -b (scores) and -d (alignments) options.

-r STATFILE
Causes fasta3 to write out the sequence identifier, superfam- ily number (if available), and similarity scores to STATFILE for every sequence in the library. These results are not sorted.

-s str
(SMATRIX) the filename of an alternative scoring matrix file. For protein sequences, BLOSUM50 is used by default, several alternate matrices are available: -s P250 (PAM250); -s P120 (PAM120) -s BS62 (BLOSUM62).

-t Translation table
tfasta3, fastx3, and tfastx3 now support the BLAST tranlation tables. See http://www.ncbi.nlm.nih.gov/htbin- post/Taxonomy/wprintgc?mode=c/.

-w # (LINELEN)
output line length for sequence alignments. (normally 60, can be set up to 200).

-x "offset1 offset2"
Causes fasta3 to start numbering the aligned sequences starting with offset1 and offset2, rather than 1 and 1. This is particularly useful for showing alignments of promoter regions.

-y
Set the band-width used for optimization. -y 16 is the default for protein when ktup=2 and for all DNA alignments. -y 32 is used for pro- tein and ktup=1. For proteins, optimization slows comparison 2-fold and is highly recommended.

-z
Select the statistical normalization strategy. -z 00 causes results to be ranked by the unnormalized opt, initn, or init1 score. -z 1 is the default, which uses scores weighted by regression of the mean (unre- lated sequence) similarity score vs length of the library sequence. -z 2 uses the older log-length scaled scores (provided for historical consistency only - not recommended). -z 3 scales using the Altschul- Gish K parameters (Altschul and Gish, Meth. Enz. (1996) 266:460- 480). -z 4 estimates significance from the mean and standard devia- tion of unscaled scores.

-3
(tfasta3, tfastx3) only. Normally ttfasta3 and tfastx3 translate sequences in the DNA sequence library in all six frames. With the -3 option, only the three forward frames are searched.

EXAMPLES

(1) fasta3 musplfm.aa $AABANK
Compare the amino acid sequence in the file musplfm.aa with the complete PIR protein sequence library using ktup = 2 Each "library" sequence (there need only be one) should start with a comment line which starts with a '>', e.g.
  >LCBO bovine preprolactin
  WILLLSQ ...
  >LCHU human ...
  ...

(2) fasta3 -a -w 80 musplfm.aa lcbo.aa 1
Compare the amino acid sequence in the file musplfm.aa with the sequences in the file lcbo.aa using ktup = 1. Show both sequences in their entirety, with 80 residues on each output line.
(3) fasta3
Run the fasta3 program in interactive mode. The program will prompt for the file name for the query sequence, list alternative libraries to be seached (if FASTLIBS is set), and prompt for the ktup.

FILES

This version of fasta3 prompts for the library file to be searched from a list of file names that are saved in the file pointed to by the environment variable FASTLIBS. If FASTLIBS = fastgb.list, then the file fastgb.list might have the entries:
  NBRF Protein$0P/u/lib/aabank.lib 0
  GB Primate$1P@/u/lib/gpri.nam
  GB Rodent$1R@/u/lib/grod.nam
  GB Mammal$1M@/u/lib/gmammal.nam
Each line in this file has 4 fields: (1) The library name, separated from the remaining fields by a '$'; (2) A 0 or a 1 indicating protein or DNA library respectively; (3) A single letter that will be used to choose the library; (4) the location of the library file itself (the library file name can contain an optional library format specfier. fasta3 recognizes the following library formats: 0 - Pearson/FASTA; 1 - Genbank flat file; 2 - NBRF/PIR Codata; 3 - EMBL/SWISS-PROT; 4 - Intelligenetics; 5 - NBRF/PIR VMS); Note that this fourth field can contain an '@' character, which indi- cates that the library file is an indirect library file containing list of library files, one per line. An indirect library file might have the lines:
  </usr/slib/genbank      (the directory for the library files)
  gbpri.seq 1
  gbrod.seq 1
  gbmam.seq 1
  ...
  gbvrl.seq 1
  ...
You can use your own sequence files for fasta3, just be certain to put a '>' and comment as the first line before the sequence.

AUTHOR

Bill Pearson wrp@virginia.EDU