NAME
hmmls
- HMM local search - multiple-hit domain or repeat detection
SYNOPSIS
hmmls [options] hmmfile seqfile
DESCRIPTION
hmmls searches a sequence database seqfile looking for matches to the HMM in hmmfile. An optimal set of nonoverlapping multiple hits to subsequences are allowed for each database sequence, so long as they are matches to the full model.
hmmls is designed to be particularly useful for detecting complete repeated motifs in long sequences. It has been used successfully for Alu detection in human cosmid sequences, and for recognizing multiple tandem domains in mosaic proteins. However, it does not have the ability to recognize fragments of matches, such as partial Alus; use hmmfs for detecting partial as well as complete matches.
The memory requirement is linear in the length of the model and independent of the database sequence size, permitting effectively infinite length genomic DNA sequences (>> 1 Mb) to be searched. The program does this by establishing a smoothly scanning matrix "window" of some fixed length, instead of keeping the full alignment matrix in memory. The drawback of this approach is that if the length of a matched subsequence exceeds this window size, the program will fail because it is unable to reconstruct a full alignment. The default window size is 1000 residues, and it is settable with the -w option.
The scores reported by hmmls are in bits of information. Specifically, they are log-odds scores: the log of the ratio of the probability of the sequence given the model and the probability of the sequence given a simple random sequence model. This score is related to the statistical significance of the alignment. A score of zero is marginal; according to the model's statistics, it's 50% likely that the alignment is a real match to the model, and 50% likely that it's not. The higher the score, the better; a score of 100 means that it is 2^100-fold more likely that the sequence is a match to the model than not. In practice, a database contains many more unrelated sequences than related ones, so the actual score required for statistical significance is somewhat higher than zero -- as a rule of thumb for protein database searches, don't trust scores lower than the log2 of the number of seqs in the database). This is 16 bits for our current SWIR5 composite protein database of 57,000 sequences. See the User Guide for more details.
Note the differences between the different HMM searching programs. hmms looks for a global alignment of HMM to sequences (Needleman/Wunsch style); overhangs of unmatches sequence are not permitted, and the full model must be matched. hmmls looks for one or more local alignments of the full model to a subsequence of each database sequence; unmatches sequence overhangs are allowed. hmmsw looks for the best fragmentary match of a subsequence to part of the model (Smith/Waterman style). hmmfs looks for multiple non-overlapping matches of subsequences to parts of the model (modified multiple-hit Smith/Waterman). hmms is useful for scoring or detecting whole proteins with complete models; hmmls is useful for scoring or detecting intact domains in protein sequences or complete repeats in nucleic acid sequences; hmmsw is useful for general protein database searching, allowing for possible incomplete matches to a model; hmmfs is useful for general nucleic acid database searching or for scoring/detecting domains in multidomain proteins, allowing for multiple non-overlapping matches per sequence.
Individual man pages: hmma(l), hmmb(l), hmme(l), hmmfs(l), hmms(l), hmmsw(l), hmmt(l), hmm-convert(l)
User guide and tutorial: Userguide.ps
BUGS
No major bugs known.
Not very tolerant of errors on the command line.
If two hits overlap even by only a few positions, one will be filtered out. This is sometimes undesirable, especially when one is studying tandemly repeated domains or sequences. Use hmmfs instead.
NOTES
This software and documentation is Copyright (C) 1992-1995, Sean R. Eddy. It is freely distributable under terms of the GNU General Public License. See COPYING, in the source code distribution, for more details, or contact me.
Sean Eddy
Dept. of Genetics, Washington Univ. School of Medicine 660 S. Euclid Box 8232
St Louis, MO 63110 USA
Phone: 1-314-362-7666
FAX : 1-314-362-2985
Email: eddy@genetics.wustl.edu