NAME
hmmt - hidden Markov model training for biological sequences
SYNOPSIS
hmmt [options] hmmfile seqfile
DESCRIPTION
hmmt attempts to learn the pattern shared by the multiple sequences in seqfile, and saves a description of the pattern in hmmfile.
seqfile may contain RNA, DNA, or amino acid sequences (don't mix them, though). It may be in any one of several different common sequence file formats, including EMBL, Genbank, and FASTA. The easiest to type in yourself is FASTA format, which consists of a line starting with > containing the name (one word) and an optional description of the sequence, followed by one or more lines of sequence.
The output file is not meant to be directly examined by the user; it is used as input to other hidden Markov modeling programs that do multiple sequence alignment [hmma] and database searching [hmmfs, hmmls, hmms, hmmsw].
hmmt works by iteratively improving a new multiple sequence alignment calculated using the model, then a new model given that alignment. A simulated annealing protocol is used to avoid bad local minima in the iterative (expectation maximization) training procedure.
"Simulated annealing" is a well-known method for avoiding obvious local minima in an optimization problem. hmmt uses a theoretically rigorous method to sample suboptimal alignments according to a "temperature" factor measured in units of the Boltzmann factor k; the higher the temperature, the more random the alignment. A temperature factor of 1.0 is equivalent to sampling alignments exactly according to their probability. The default parameters of simulated annealing work well. If you are unhappy with them, though, you can set a starting temperature with the -k option; 5 to 10 is a good choice (default is 5). You may also set a ramp factor -r ; by default, this is set to 0.95, which means the temperature will be decreased to 95% of its current value at each iteration.
Besides simulated annealing, two other training algorithms are available. -v toggles standard Viterbi approximation to Baum-Welch expectation maximization. As a training algorithm, it is fast, but prone to serious local minimum problems (it makes bad models unless you've provided a good starting hint at the alignment). -B toggles full Baum-Welch expectation maximization. Full Baum-Welch is slow and usually not quite as good as simulated annealing.
By default, the starting model is a model with uniform state transition and symbol emission probabilities, with length equal to the average length of the sequences in seqfile. A different starting model may be provided as a hint, using the -i option. A common procedure would be to build a hint HMM from an alignment of a small number of sequences, then give that model with -i to hmmt for training on a larger number of sequences. Important: simulated annealing works by initially "melting" the starting alignment, so providing a hint to simulated annealing has no effect. -i is only useful for -v or -B training, or perhaps if the initial temperature of simulated annealing is reduced (see the -k option).
Another training option is "constrained simulated annealing". If you know the structure of some of your training sequences, you can construct a structural alignment of them and keep the rest of the homologues in a separate file. Using the -a option, hmmt can combine both a known multiple alignment and a set of unaligned homologues into a single training set. The alignment will remain fixed throughout the training process, while the homologues are aligned to it. The -o option should be used to save the final alignment if you desire.
OPTIONS
-a <alignfile>
Include the multiple alignment in alignfile , which must be in SELEX or GCG MSF alignment format, into the training set. This alignment is not allowed to change during training. The sequences in the alignment are in addition to the unaligned training set; you must remove duplicate sequences between the two files before hand.
The PAM file may be either a BLOSUM matrix (Henikoff and Henikoff, PNAS 89:10915-10919, 1992) or a PAM matrix produced by the NCBI "pam" program that comes with BLAST (Altschul et al., JMB 215:403-410 1990). Other matrices may be used if they conform to one of these formats.
Individual man pages: hmma(l), hmmb(l), hmme(l), hmmfs(l), hmmls(l), hmms(l), hmmsw(l), hmm-convert(l)
User guide and tutorial: Userguide.ps
BUGS
No major bugs known.
Not very tolerant of errors on the command line.
NOTES
This software and documentation is Copyright (C) 1992-1995,
Sean R. Eddy. It is freely distributable under terms of the GNU General Public License. See COPYING, in the source code distribution, for more details, or contact me.
Sean Eddy
Dept. of Genetics, Washington Univ. School of Medicine 660 S. Euclid Box 8232
St Louis, MO 63110 USA
Phone: 1-314-362-7666
FAX : 1-314-362-2985
Email: eddy@genetics.wustl.edu