Pittsburgh Supercomputing Center 

Advancing the state-of-the-art in high-performance computing,
communications and data analytics.

SNAP

 

SNAP is a general purpose gene finding program suitable for both eukaryotic and prokaryotic genomes. SNAP is an acroynm for Semi-HMM-based Nucleic Acid Parser.

Installed on blacklight and biou.

Other resources that may be helpful include:

Running SNAP

On blacklight

The SNAP programs are made availiable for use through the module command. To load the SNAP module enter:

module load snap-hmm

On biou

The SNAP programs are availiable through the Galaxy instance on biou.

To make the SNAP programs availiable through the command line, csh users should enter the following command:

source /packages/bin/SETUP_BIO_SOFTWARE

To make the SNAP programs availiable through the command line, bash users should enter the following command:

source /packages/bin/SETUP_BIO_SOFTWARE

SNAP Command line usage

snap [options] <HMM file> <FASTA file > [options]

options:

-lcmask treat lowercase as N
-plus predict on plus strand only
-minus predict on minus strand only
-gff output annotation as GFF
-ace output annotation as ACED
-aa <file> create FASTA file of proteins
-tx <file> create FASTA file of transcripts
-xdef <file> external definitions
-name <string> name for the gene [default snap]
-quiet do not send progress to STDERR
-help report useful information

HMM model files for SNAP

To use SNAP, you must either build your own HMM model file or use a precompiled HMM model file.

Using a pre-compiled HMM model file

 

A number of precompiled HMM model files are included in the SNAP release. These files include:

  • Acanium.hmm
  • A.gambiae.hmm
  • A.mellifera.hmm
  • A.thaliana.hmm
  • At.hmm
  • B.malayi.hmm
  • B.mori.hmm
  • Ce.hmm
  • C.elegans.hmm
  • C.intestinalis.hmm
  • D.melanogaster.hmm
  • Dm.hmm ixodes
  • A.hmm ixodes
  • B.hmm
  • mam39.hmm
  • mam39-ro.hmm
  • mam46.hmm
  • mam46-ro.hmm
  • mam54.hmm
  • mam54-ro.hmm
  • mamiso.hmm
  • minimal.hmm
  • Nasonia.hmm
  • nGASP.hmm
  • nGASPr.hmm
  • O.sativa.hmm
  • Os.hmm worm1.hmm

     

  • brugia - same as B.malayi.hmm
  • ciona - same as C.intestinalis.hmm
  • fly - same as D.melanogaster.hmm
  • mosquito - same as A.gambiae.hmm
  • rice - same as O.sativa.hmm
  • thale - same as At.hmm
  • worm - same as Ce.hmm

 

If your genome is represented above (or is a close relative of a genome represented above), you may use the pre-compiled HMM model file, with the -d option followed by the directory containing the pre-compiled HMM model file. The precompiled HMM model files can be found in the directory:

$SNAPHMM_HOME/HMM

For example to use the precompiled set for the D.melanogaster genome on a set of sequences contained in the file fasta.file, you would use the following on the command line:

% snap $SNAPHMM_HOME/HMM/D.melanogaster.hmm fasta.file

Compiling your own HMM model files

Note: The author of SNAP would like to be contacted should you wish to train SNAP for a new genome. A minimal parameter estimation procedure is outlined below. There are a number of options for forge and hmm-assembler.pl that are not described below.

  1. Prepare the sequences and gene structures
    • Sequences must be in FASTA format. It's a good idea if you don't have genes that are too related to each other.
    • Gene structures must be in ZFF short format. In the short format, the sequence records are separated by a definition line, just like FASTA. there are 4 fields: Label, Begin, End, Group. The 4th field is optional. Label is a controlled vocabulary including:
      • Esngl:   single exon gene
      • Einit:  initial exon
      • Eterm:  terminal exon
      • Exon:  generic or internal exon

       

      See $SNAPHMM_HOME/Zoe/zoeFeature.h for a complete list.

      All exons of a gene must share the same unique Group name. The strand of the feature is implied in the coordinates, so if Begin > End, the feature is on the minus strand. Here's an example with two sequences, each containing a single gene on the plus strand:

      >sequence-1
      Einit 201 325 Y73E7A.6
      Eterm 2175 2319 Y73E7A.6
      >sequence-2
      Einit 201 462 Y73E7A.7
      Exon 1803 2031 Y73E7A.7
      Exon 2929 3031 Y73E7A.7
      Exon 3467 3624 Y73E7A.7
      Exon 4185 4406 Y73E7A.7
      Eterm 5103 5280 Y73E7A.7
      

      The most important part of parameter estimation is preparing a training set. There are many ways to go about this. At the end, you want these in the ZFF short format. Save the ZFF as genome.ann and the FASTA as genome.dna.

  2. Look at some features of the genes:
    fathom genome.ann genome.dna -gene-stats
  3. Verify that the genes have no obvious errors:
    fathom genome.ann genome.dna -validate

    You may find some errors and warnings. Check these out in some kind of genome browser and remove those that are real errors.

  4. Break up the sequences into fragments with one gene per sequence with the following command:
    fathom -genome.ann genome.dna -categorize 1000

    There will be up to 1000 bp on either side of the genes. You will find several new files.

    • alt.ann, alt.dna (genes with alternative splicing)
    • err.ann, err.dna (genes that have errors)
    • olp.ann, olp.dna (genes that overlap other genes)
    • wrn.ann, wrn.dna (genes with warnings)
    • uni.ann, uni.dna (single gene per sequence)

       

  5. Convert the uni genes to plus stranded with the command:
    fathom uni.ann uni.dna -export 1000 -plus

    You will find 4 new files:

    • export.aa proteins corresponding to each gene
    • export.ann gene structure on the plus strand
    • export.dna DNA of the plus strand
    • export.tx transcripts for each gene

       

  6. The parameter estimation program, forge, creates a lot of files. You probably want to create a directory to keep things tidy before you execute the program.
    mkdir params
    cd params
    forge ../export.ann ../export.dna
    cd ..
  7. Finally, build an HMM.
    hmm-assembler.pl my-genome params > my-genome.hmm