next up previous contents
Next: Searching for multiple Up: Searching sequence databases Previous: Searching sequence databases

Smith/Waterman searches: hmmsw

hmmsw (``HMM Smith/Waterman'') finds subsequences which match to parts of the HMM -- i.e. local alignments with respect to both the model and the sequence. This is the most useful program for searching protein databases. It will find significant matches to sequence fragments even if they match only part of the model. hmmsw reports only the single best match per target sequence. For multiple matches per sequence, see hmmfs.

The following command searches the seven globins in bashford.slx using the model globin2.hmm:

> hmmsw globin2.hmm bashford.slx

hmmsw prints out the score of a match, the start and end point of the matched subsequence, the start and end point on the model, and the name and description of the target sequence. hmmsw reports only the single best match per target sequence.

A good score is anything over zero. A significant score is anything over about 20. For more details, see ``What do HMM scores mean?'' in the Frequently Asked Questions chapter.

Note that the alignment file bashford.slx was used in a context where unaligned sequences were expected. This always works. If you give an alignment file where an unaligned sequence file is expected, the sequences will be read as unaligned. (One place where this might come up for you is in specifying the same file as both a seed alignment and as training sequences for hmmt).

You can use -t <cutoff> to change the score threshold for reporting hits. If you set it to something very negative (say, -9999) you can force hmmsw to print the best scores to all the sequences in the database -- you might do this to generate a histogram of all the scores, for instance.

hmmsw corrects for the length of the model and the length of the target sequence when it calculates the scores. (One expects more chance hits to longer models and longer target sequences.) Since the scores are corrected, a significant score against a single sequence is still 0 bits or more, and a significant score against a database of sequences is still the log base 2 of the number of sequences in the database (about 16 bits, for the current size of the nonredundant SwissProt/PIR composite databases). However, be warned that hmmsw makes some assumptions in the process of doing this, and you would be well advised to do some empirical tests if your scores aren't clear-cut.

A useful option to hmmsw is -F, which requests ``fancy'' output. hmmsw then prints an alignment of the match as well as its score, in a format similar to the format used by the BLAST programs. The HMM consensus is shown with capital letters for strongly conserved consensus positions and small letters for weakly conserved positions. A line between the HMM consensus and the target sequence shows letters for matches to the best possible symbol in the HMM and '+' characters for ``mismatches'' that nonetheless contribute a positive score.



next up previous contents
Next: Searching for multiple Up: Searching sequence databases Previous: Searching sequence databases



Sean Eddy
Mon Apr 17 09:54:19 CDT 1995