The first thing you will want to do is build a model of your sequence family of interest. If you have a multiple sequence alignment, you can quickly produce an HMM from the alignment using hmmb (which stands for ``HMM build''). The following command produces an HMM from the aligned sequences in globins50.msf:
> hmmb globin.hmm globins50.msf
The HMM file globin.hmm is in binary format. You are not meant to be able to read it. Later, I will discuss the program hmm-convert, which can convert these files into a readable ASCII format.
By default, hmmb produces maximum likelihood (ML) models. ML models are great if your sequence data are well-distributed -- in other words, there aren't a bunch of duplicate or highly similar sequences. If you have a dataset in which some sequences are overrepresented relative to others, an ML model will be biased by that overrepresentation. It might recognize the underrepresented sequences poorly or not at all. A simple solution to this is to weight the sequences so underrepresented ones count more towards the model's statistics. Various weighting rules have been proposed in the literature. hmmb -w uses a simple and effective weighting rule proposed by Erik Sonnhammer and Richard Durbin:
> hmmb -w globin-wgt_ml.hmm globins50.msf
hmmb also provides a more sophisticated alternative to maximum likelihood and weighted maximum likelihood, which we call ``maximum discrimination'' training [5]. The -d option calls for maximum discrimination (MD) training. This rule optimizes the ability of the model to recognize all the example sequences and discriminate them from unrelated sequences. In my experiments, MD models seem to be slightly better at recognizing distant homologues than ML models. I recommend that you explore the use of MD models as well as ML models for database searching. The following command produces an MD model of the 50 globins:
> hmmb -d globin-md.hmm globins50.msf