next up previous contents
Next: What HMMs can Up: Overview Previous: Overview

What HMMs are

Hidden Markov models (HMMs) are statistical models of the primary structure consensus of a sequence family. Anders Krogh, David Haussler, and co-workers at UC Santa Cruz introduced a form of HMM which is well-suited to protein and DNA sequence analysis [10], adopting HMM techniques which have been used for years in speech recognition. HMMs had been used in biology before -- notably, for modeling protein structure [16] -- but the Krogh paper had a particularly dramatic impact. Since then, several computational biology groups (including ours) have rapidly adopted HMMs as the underlying formalism for how to deal with problems involving primary sequence consensus, such as multiple sequence alignment and sensitive database searching.

Krogh et al. developed HMMs that are similar to ``profiles'' [7,6], ``flexible patterns'' [3], and ``templates''[4,17]. All of these are statistical descriptions of the consensus of a multiple sequence alignment. They use position-specific scores for amino acids (or nucleotides) and position specific scores for opening and extending an insertion or deletion. In contrast, traditional pairwise alignment (for example, using BLAST [2], FASTA [12], or the Smith/Waterman algorithm [15]) uses position- independent scoring parameters. This property of HMMs, profiles, and their kin captures important information about the degree of conservation at various positions in the multiple alignment, and the varying degree to which gaps and insertions are permitted. HMM- or profile-based methods typically outperform pairwise methods in both alignment accuracy and database search sensitivity and specificity.

The advantage of HMMs over these other methods is that HMMs have a formal probabilistic basis. We can use Bayesian probability theory to guide how all the probability (scoring) parameters should be set. Though this sounds like a purely academic issue of mathematical beauty, this probabilistic basis lets us do things that the more heuristic methods cannot do. One example is that an HMM can be trained from unaligned sequences, if a trusted alignment isn't yet known. Another is that HMMs have a consistent theory behind gap and insertion scores. In most details, HMMs are a slight improvement over a carefully constructed profile -- but far less skill and manual intervention is necessary to train a good HMM and use it.

HMMs do have important limitations. The biggest is that HMMs do not capture any higher-order correlations. An HMM assumes that the identity of a particular position is independent of the identity of all other positions. gif HMMs make poor models of RNAs, for instance, because an HMM cannot describe base pairs. Also, compare protein ``threading'' methods, which do include statistical terms for nearby amino acids in a protein structure.

A general definition of HMMs and an excellent tutorial introduction to their use has been written by Rabiner [13]. Throughout, I will use ``HMM'' to refer to the specific case of sequence profile type HMMs as described by Krogh et al. [10]. This shorthand usage is for convenience only.



next up previous contents
Next: What HMMs can Up: Overview Previous: Overview



Sean Eddy
Mon Apr 17 09:54:19 CDT 1995