next up previous contents
Next: My HMM doesn't Up: Frequently Asked Questions Previous: What HMMs are

What do HMM scores mean?

Most of the programs report a score for the alignment of the model to a sequence. A very basic and important question is, what does this score mean?

The short answer is that the score is related to the statistical significance of the alignment. A score of zero is marginal; according to the model's statistics, it's 50% likely that the alignment is a real match to the model, and 50% likely that it's not. The higher the score, the better.

More precisely, the score is a log odds ratio, the log base two -- bits, in information theory parlance -- of the probability of the alignment to the model divided by the probability of the sequence given the ``random'' overall composition of sequences (i.e. the product of the frequencies of each amino acid in the sequence). A score of 100 bits means it's 2-fold more likely that the sequence is a match to the model.

Usually one expects to see many more non-hits than hits a priori in a database search. Therefore it takes a higher score than zero to be significant. Exactly how much higher is directly related to the a priori expectation; for instance, if one expects to find less than 1 hit in a database of 50,000 unrelated proteins, significant scores will be greater than -log (50,000 / 1) -- about 16 bits, rather than 0 bits. None of the HMM programs correct for the size of the database; that's up to you. hmmsw and hmmfs do correct for the size of the model and the size of the target sequence, unlike most implementations of traditional Smith/Waterman database searching.

It's possible to build very small or very indiscriminate models which never find significant matches, because there's not enough information in the model to begin with.

This log-odds scoring system is more convenient than using raw log likelihoods, because it corrects for a strong sequence length dependence of the log likelihoods. It is analogous to the scoring system used by the BLAST suite of programs, which also report scores in bits [8]. Altschul has discussed the information theoretic interpretation of such scores [1]; although he specifically deals with ungapped BLAST alignments, much of his discussion also applies to gapped HMM alignments.

There is one important caveat. The score assumes that unrelated sequences look like random sequence with composition similar to the average over all the database. Many sequences are actually strongly composition-biased. The HMM search programs may report a spurious match to some composition-biased sequences, depending on the bias in the model. A model of a very hydrophobic protein family may match poly-leucine sequences, for instance. Watch out for this.



next up previous contents
Next: My HMM doesn't Up: Frequently Asked Questions Previous: What HMMs are



Sean Eddy
Mon Apr 17 09:54:19 CDT 1995