Pittsburgh Supercomputing Center 

Advancing the state-of-the-art in high-performance computing,
communications and data analytics.

BLAST

   The Basic Local Alignment Search Tool (BLAST) finds regions of
   local similarity between sequences. The program compares nucleotide
   or protein sequences to sequence databases and calculates the
   statistical significance of matches. BLAST can be used to infer
   functional and evolutionary relationships between sequences as well
   as help identify members of gene families.

   There are many search programs in the blast suite, depending on the
   type of analysis to be done:
   
   blastn - Search a nucleotide database using a nucleotide query
            Methods: blastn, megablast, discontiguous megablast
    
   blastp - Search protein database using a protein query
            Methods: blastp, psi-blast, phi-blast, delta-blast

   blastx - Search protein database using a translated nucleotide query

   tblastn - Search translated nucleotide database using a protein query

   tblastx - Search translated nucleotide database using a translated
             nucleotide query

   psiblast - Position-Specific Initiated BLAST

   rpsblast - Reverse Position Specific BLAST

   rpstblast - Translated Reverse Position Specific BLAST

   deltablast - Domain enhanced lookup time accelarated BLAST

Installed on blacklight, biou

Other resources that may be helpful include:

    Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990)
    "Basic local alignment search tool."
    J. Mol. Biol. 215:403-410.

    Altschul, S.F., Madden, T.L., Sch###er, A.A., Zhang, J., Zhang, Z.,
    Miller, W. & Lipman, D.J. (1997)
    "Gapped BLAST and PSI-BLAST: a new generation of protein database
    search programs." Nucleic Acids Res. 25:3389-3402

    Website: http://www.ncbi.nlm.nih.gov/books/NBK1762/


Running BLAST

1) Make BLAST programs availiable for use
   a) blacklight:
   The BLAST programs will be made availiable for use through
   the module command. To load the BLAST module enter:

   module load ncbi-blast
 
   b) biou:

   The BLAST programs are availiable through the Galaxy instance on biou.

   To make the BLAST programs availiable through the command line,
   csh users should enter the following command:

   % source /packages/bin/SETUP_BIO_SOFTWARE

   To make the BLAST programs availiable through the command line, bash
   users should enter the following command:
 
   % source /packages/bin/SETUP_BIO_SOFTWARE

2) General Usage:

   To find the general usage of an individual program in the BLAST suite,
   use the -help flag. For example:

   blastn -help
   blastp -help
   blastx -help
   deltablast -help
   makeblastdb -help
   makeprofiledb -help
   psiblast -help
   rpsblast -help
   rpstblastn -help
   tblastn -help
   tblastx -help

   To run blast using your own fasta formatted sequence collection as a
   database, make sure that the database is converted to blast format
   prior to running the blast command. The program within the blast suite
   that does blast database formating from a fasta sequence collection is
   called "makeblastdb". For example:
 
   makeblastdb -in uniprot_sprot.fasta -dbtype prot

   After the database is formatted, run the desired blast program. For
   example:

   blastp -query myquery.fasta -num_threads 16 -db uniprot_sprot.fasta -out blastp.out


3) PBS Examples (blacklight)

   a) blastp (16 cores): This example illustrates the simplest way to run
      the blastp program on blacklight using 16 threads:

      #!/bin/bash
      #PBS -l ncpus=16
      #PBS -l walltime=24:00:00
      #PBS -j oe
      #PBS -q batch
      #PBS -N Blastp
      #
      source /usr/share/modules/init/bash
      module load ncbi-blast
      module load trinotate
      set -x
      ja $SCRATCH/$$.ja
      #
      # Data
      #
      OUTDIR=/brashear/$USER/Blastp
      PROT=/brashear/$USER/protein.fasta
      BLASTTHREADS=16
      #
      mkdir -p $OUTDIR
      cd $OUTDIR
      #
      # Take protein file and run BLAST against SWISSPROT:
      #
      date
      blastp -query $PROT -num_threads $BLASTTHREADS -db $TRDB_SPROT \
             -out blastp.out > blastp.log 2>&1
      ja -set $SCRATCH/$$.ja


   b) Scalable BLAST (32 core): This example illustrates how to use
      fasta_splitter to scale the blastp program on blacklight using
      2 parallel runs at 16 threads each:

      #!/bin/bash
      #PBS -l ncpus=32
      #PBS -l walltime=96:00:00
      #PBS -j oe
      #PBS -q batch
      #PBS -N Trinotate_blastp
      #
      source /usr/share/modules/init/bash
      module load trinotate
      module unload perl
      module load fasta_splitter
      set -x
      ja $SCRATCH/$$.ja
      #
      # Data
      #
      OUTDIR=/brashear/$USER/Trinity_Output/Trinotate_Blastp
      TRINITY_PROT=/brashear/$USER/Trinity_Output/Proteins/best_candidates.eclipsed_orfs_removed.pep
      #
      # parallel parameters
      #
      BLASTRUNS=2    # Number of independent BLAST runs
      BLASTTHREADS=16 # Number of BLAST Threads ( BLASTTHREAD * BLASTRUNS = NCPUS)
      #
      mkdir -p $OUTDIR
      cd $OUTDIR
      #
      # Split file into sections so we can run blast independently
      #
      cp $TRINITY_PROT Trinity_protein.fasta
      perl $FASTA_SPLIT_HOME/fasta_splitter.pl -n-parts-sequence $BLASTRUNS Trinity_protein.fasta
      ls -l
      #
      # TRINOTATE: Take protein file and run BLAST against SWISSPROT:
      #
      date
      PLACEON=0
      PART=1
      for F in Trinity_protein.part*
      do
        ((PLACETHROUGH= PLACEON + BLASTTHREADS - 1))
        dplace -o dplacelog$PLACEON -c $PLACEON-$PLACETHROUGH blastp \
          -query $F -num_threads $BLASTTHREADS -db $TRDB_SPROT -outfmt 6 \
          -max_target_seqs 1 \
          -out blastp_$PART.out \
          > blastp_$PART.log 2>&1 &
        ((PLACEON= PLACEON + BLASTTHREADS))
        ((PART=PART + 1))
      done
      wait
      cat blastp_*.out > Trinity_protein_blastp_all.out
      head -100 dplacelog*
      date
      ja -set $SCRATCH/$$.ja
[costa@pscuxa ~]$ cat BLAST.txt
BLAST

   The Basic Local Alignment Search Tool (BLAST) finds regions of
   local similarity between sequences. The program compares nucleotide
   or protein sequences to sequence databases and calculates the
   statistical significance of matches. BLAST can be used to infer
   functional and evolutionary relationships between sequences as well
   as help identify members of gene families.

   There are many search programs in the blast suite, depending on the
   type of analysis to be done:
   
   blastn - Search a nucleotide database using a nucleotide query
            Methods: blastn, megablast, discontiguous megablast
    
   blastp - Search protein database using a protein query
            Methods: blastp, psi-blast, phi-blast, delta-blast

   blastx - Search protein database using a translated nucleotide query

   tblastn - Search translated nucleotide database using a protein query

   tblastx - Search translated nucleotide database using a translated
             nucleotide query

   psiblast - Position-Specific Initiated BLAST

   rpsblast - Reverse Position Specific BLAST

   rpstblast - Translated Reverse Position Specific BLAST

   deltablast - Domain enhanced lookup time accelarated BLAST

Installed on blacklight, biou

Other resources that may be helpful include:

    Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990)
    "Basic local alignment search tool."
    J. Mol. Biol. 215:403-410.

    Altschul, S.F., Madden, T.L., Sch###er, A.A., Zhang, J., Zhang, Z.,
    Miller, W. & Lipman, D.J. (1997)
    "Gapped BLAST and PSI-BLAST: a new generation of protein database
    search programs." Nucleic Acids Res. 25:3389-3402

    Website: http://www.ncbi.nlm.nih.gov/books/NBK1762/


Running BLAST

1) Make BLAST programs availiable for use
   a) blacklight:
   The BLAST programs will be made availiable for use through
   the module command. To load the BLAST module enter:

   module load ncbi-blast
 
   b) biou:

   The BLAST programs are availiable through the Galaxy instance on biou.

   To make the BLAST programs availiable through the command line,
   csh users should enter the following command:

   % source /packages/bin/SETUP_BIO_SOFTWARE

   To make the BLAST programs availiable through the command line, bash
   users should enter the following command:
 
   % source /packages/bin/SETUP_BIO_SOFTWARE

2) General Usage:

   To find the general usage of an individual program in the BLAST suite,
   use the -help flag. For example:

   blastn -help
   blastp -help
   blastx -help
   deltablast -help
   makeblastdb -help
   makeprofiledb -help
   psiblast -help
   rpsblast -help
   rpstblastn -help
   tblastn -help
   tblastx -help

   To run blast using your own fasta formatted sequence collection as a
   database, make sure that the database is converted to blast format
   prior to running the blast command. The program within the blast suite
   that does blast database formating from a fasta sequence collection is
   called "makeblastdb". For example:
 
   makeblastdb -in uniprot_sprot.fasta -dbtype prot

   After the database is formatted, run the desired blast program. For
   example:

   blastp -query myquery.fasta -num_threads 16 -db uniprot_sprot.fasta -out blastp.out


3) PBS Examples (blacklight)

   a) blastp (16 cores): This example illustrates the simplest way to run
      the blastp program on blacklight using 16 threads:

      #!/bin/bash
      #PBS -l ncpus=16
      #PBS -l walltime=24:00:00
      #PBS -j oe
      #PBS -q batch
      #PBS -N Blastp
      #
      source /usr/share/modules/init/bash
      module load ncbi-blast
      module load trinotate
      set -x
      ja $SCRATCH/$$.ja
      #
      # Data
      #
      OUTDIR=/brashear/$USER/Blastp
      PROT=/brashear/$USER/protein.fasta
      BLASTTHREADS=16
      #
      mkdir -p $OUTDIR
      cd $OUTDIR
      #
      # Take protein file and run BLAST against SWISSPROT:
      #
      date
      blastp -query $PROT -num_threads $BLASTTHREADS -db $TRDB_SPROT \
             -out blastp.out > blastp.log 2>&1
      ja -set $SCRATCH/$$.ja


   b) Scalable BLAST (32 core): This example illustrates how to use
      fasta_splitter to scale the blastp program on blacklight using
      2 parallel runs at 16 threads each:

      #!/bin/bash
      #PBS -l ncpus=32
      #PBS -l walltime=96:00:00
      #PBS -j oe
      #PBS -q batch
      #PBS -N Trinotate_blastp
      #
      source /usr/share/modules/init/bash
      module load trinotate
      module unload perl
      module load fasta_splitter
      set -x
      ja $SCRATCH/$$.ja
      #
      # Data
      #
      OUTDIR=/brashear/$USER/Trinity_Output/Trinotate_Blastp
      TRINITY_PROT=/brashear/$USER/Trinity_Output/Proteins/best_candidates.eclipsed_orfs_removed.pep
      #
      # parallel parameters
      #
      BLASTRUNS=2    # Number of independent BLAST runs
      BLASTTHREADS=16 # Number of BLAST Threads ( BLASTTHREAD * BLASTRUNS = NCPUS)
      #
      mkdir -p $OUTDIR
      cd $OUTDIR
      #
      # Split file into sections so we can run blast independently
      #
      cp $TRINITY_PROT Trinity_protein.fasta
      perl $FASTA_SPLIT_HOME/fasta_splitter.pl -n-parts-sequence $BLASTRUNS Trinity_protein.fasta
      ls -l
      #
      # TRINOTATE: Take protein file and run BLAST against SWISSPROT:
      #
      date
      PLACEON=0
      PART=1
      for F in Trinity_protein.part*
      do
        ((PLACETHROUGH= PLACEON + BLASTTHREADS - 1))
        dplace -o dplacelog$PLACEON -c $PLACEON-$PLACETHROUGH blastp \
          -query $F -num_threads $BLASTTHREADS -db $TRDB_SPROT -outfmt 6 \
          -max_target_seqs 1 \
          -out blastp_$PART.out \
          > blastp_$PART.log 2>&1 &
        ((PLACEON= PLACEON + BLASTTHREADS))
        ((PART=PART + 1))
      done
      wait
      cat blastp_*.out > Trinity_protein_blastp_all.out
      head -100 dplacelog*
      date
      ja -set $SCRATCH/$$.ja

 

Trinotate

   Trinotate is an annotation method suitable for computationally assembled
   transcripts from RNAseq sequencing data.

Installed on blacklight

Other resources that may be helpful include:

    Website: http://trinotate.sourceforge.net/


Running Trinotate

1) Make Trinotate programs availiable for use
  
   The Trinotate process relies on a number of underlying program and
   sequence databases. To make all of these programs availiable for use
   use the following module command:

   module load trinotate

   This module will load a number of modules including: trinotate_db,
   ncbi-blast, signalp, tmhmm, hmmer, RNAmmer.   
 

2) General Usage:

   The general Trinotate process is as follows. With the transcripts:

   a) Run blastx with the transcripts against uniprot-swissprot
   b) Run RNAmmer with the transcripts
   c) Generate conceptual protein translations of the transcripts
      1) Run blastp with the conceptual protein translations against
         uniprot-swissprot
      2) Run hmmsearch with the conceptual protein translations against
         PFAM
      3) Run signalp with the conceptual protein translations
      4) Run tmhmm with the conceptual protein translations
   d) When the above runs are done, load them into a pre-populated
      SQLite database that contains annotation information (including go
      terms) linked to the uniprot-swissprot and pfam identifiers.     
   e) query the database and generate a report that can be viewed in
      spreadsheet software.

      In general, the blastx and blastp steps will take the most amount
      of time.

3) PBS Examples

   a) Below is a simple example that runs most of the Trinotate process in
      one PBS job. If you have many transcripts (> 10,000), we do not
      recommend running the entire process in one batch job in the way
      illustrated below.   

      #!/bin/bash
      #PBS -l ncpus=16
      #PBS -l walltime=24:00:00
      #PBS -j oe
      #PBS -q batch
      #PBS -N Trinotate
      #
      source /usr/share/modules/init/bash
      module load java
      module load trinity
      module load trinotate
      module load sqlite
      set -x
      ja $SCRATCH/$$.ja
      #
      # Sample Data
      #
      OUTDIR=/brashear/$USER/Trinotate
      PROT=/brashear/$USER/Protein/best_candidates.eclipsed_orfs_removed.pep
      NA=/brashear/$USER/trinity_out_dir/Trinity.fasta
      #
      mkdir -p $OUTDIR
      cd $OUTDIR
      #
      # Here we start with JUST transcripts and protein coding regions in
      # those transcripts.
      #
      # STEP 1: Take protein file and run BLAST against SWISSPROT:
      #
      blastp -query $PROT -db $TRDB_SPROT -num_threads 16 -max_target_seqs 1 \
       -outfmt 6 > TrinotateBlast.out
      #
      # STEP 2: Take protein file and run hmmscan against PFAM-A:
      #
      hmmscan --cpu 16 --domtblout TrinotatePFAM.out $TRDB_PFAM $PROT > pfam.log
      #
      # STEP 3: Run signalp to predict signal peptides:
      #
      signalp -f short -n signalp.out $PROT
      #
      # STEP 4: Run tmhmm to predict transmembrane regions:
      #
      tmhmm --short < $PROT > tmhmm.out
      #
      # STEP 5: Take nucleic acid file and run BLAST against SWISSPROT:
      #
      blastx \
         -query $NA -num_threads $BLASTTHREADS -db $TRDB_SPROT -outfmt 6 \
         -max_target_seqs 1 \
         -out TrinotateBlastx.out
      #
      # STEP 6: Run tmhmm to predict transmembrane regions:
      #
      $TRINOTATE_HOME/util/rnammer_support/RnammerTranscriptome.pl \
       --transcriptome $NA --path_to_rnammer $RNAMMER_HOME/rnammer \
       >& RnammerTranscriptome.log
      #
      # STEP 7: Generate Gene/Transcript relationships
      #
      $TRINITY_HOME/util/get_Trinity_gene_to_trans_map.pl $NA \
          >  Trinity.fasta.gene_trans_map
      #
      # STEP 8: Initialize SQLITE Database
      #
      cp $TRDB_SQLITE Trinotate.sqlite
      Trinotate.pl init --gene_trans_map Trinity.fasta.gene_trans_map \
         --transcript_fasta $NA --transdecoder_pep $PROT
      #
      # STEP 9: Load Blast Results
      #
      Trinotate.pl LOAD_blast TrinotateBlast.out
      #
      # STEP 10: Load PFAM Results
      #
      Trinotate.pl LOAD_pfam TrinotatePFAM.out
      #
      # STEP 11: Load tmhmm Results
      #
      Trinotate.pl LOAD_tmhmm tmhmm.out
      #
      # STEP 12: Load signalp Results
      #
      Trinotate.pl LOAD_signalp signalp.out
      #
      # STEP 13: Load rnammer Results
      #
      Trinotate.pl LOAD_rnammer Trinity.fasta.rnammer.gff
      #
      # STEP 14: Load blastx Results
      #
      Trinotate.pl LOAD_blastx TrinotateBlastx.out
      #
      # STEP 15: Generate annotation report
      #
      Trinotate.pl report > trinotate_annotation_report.xls
      #
      ja -set $SCRATCH/$$.ja


   b) Below are examples runs the Trinotate processes as individual PBS jobs,
      This is the recommended method if you have a large number of transcripts

      **********************
      ******* BLASTX *******
      **********************

      #!/bin/bash
      #PBS -l ncpus=32
      #PBS -l walltime=24:00:00
      #PBS -j oe
      #PBS -q batch
      #PBS -N Trinotate_blastx
      #
      source /usr/share/modules/init/bash
      module load trinotate
      module unload perl
      module load fasta_splitter
      set -x
      ja $SCRATCH/$$.ja
      #
      # Data
      #
      OUTDIR=/brashear/$USER/Trinotate_blastx
      TRINITY_NA=/brashear/$USER/trinity_out_dir/Trinity.fasta
      TRINITY_PROT=/brashear/$USER/Proteins/best_candidates.eclipsed_orfs_removed.pep
      #
      # parallel parameters
      #
      BLASTRUNS=2    # Number of independent BLAST runs
      BLASTTHREADS=16 # Number of BLAST Threads ( BLASTTHREAD*BLASTRUNS = NCPUS)
      #
      mkdir -p $OUTDIR
      cd $OUTDIR
      #
      # Split file into sections so we can run blast independently
      #
      cp $TRINITY_NA Trinity_na.fasta
      perl $FASTA_SPLIT_HOME/fasta_splitter.pl -n-parts-sequence \
           $BLASTRUNS Trinity_na.fasta
      ls -l
      #
      # TRINOTATE: Take protein file and run BLAST against SWISSPROT:
      #
      date
      PLACEON=0
      PART=1
      for F in Trinity_na.part*
      do
        ((PLACETHROUGH= PLACEON + BLASTTHREADS - 1))
        dplace -o dplacelog$PLACEON -c $PLACEON-$PLACETHROUGH blastx \
          -query $F -num_threads $BLASTTHREADS -db $TRDB_SPROT -outfmt 6 \
          -max_target_seqs 1 \
          -out blastx_$PART.out \
          > blastx_$PART.log 2>&1 &
        ((PLACEON= PLACEON + BLASTTHREADS))
        ((PART=PART + 1))
      done
      wait
      ls -l
      cat blastx_*.out > Trinity_na_blastx_all.out
      head -100 dplacelog*
      date
      ja -set $SCRATCH/$$.ja

      ***********************
      ******* RNAMMER *******
      ***********************

      #!/bin/bash
      #PBS -l ncpus=16
      #PBS -l walltime=24:00:00
      #PBS -j oe
      #PBS -q batch
      #PBS -N Trinotate_rnammer
      #
      source /usr/share/modules/init/bash
      module load trinotate
      module unload perl
      module load perl/5.12.5-threads
      set -x
      ja $SCRATCH/$$.ja
      #
      # Data
      #
      OUTDIR=/brashear/$USER/Trinotate_RNAmmer
      TRINITY_NA=/brashear/$USER/trinity_out_dir/Trinity.fasta
      TRINITY_PROT=/brashear/$USER/Proteins/best_candidates.eclipsed_orfs_removed.pep
      #
      mkdir -p $OUTDIR
      cd $OUTDIR
      ls -l
      #
      # TRINOTATE: Take na file and run rnammer
      #
      date
      $TRINOTATE_HOME/util/rnammer_support/RnammerTranscriptome.pl \
        --transcriptome $TRINITY_NA --path_to_rnammer $RNAMMER_HOME/rnammer \
        >& RnammerTranscriptome.log
      date
      ja -set $SCRATCH/$$.ja


      **********************
      ******* BLASTP *******
      **********************


      #!/bin/bash
      #PBS -l ncpus=32
      #PBS -l walltime=96:00:00
      #PBS -j oe
      #PBS -q batch
      #PBS -N Trinotate_blastp
      #
      source /usr/share/modules/init/bash
      module load trinotate
      module unload perl
      module load fasta_splitter
      set -x
      ja $SCRATCH/$$.ja
      #
      # Data
      #
      OUTDIR=/brashear/$USER/Trinotate_blastp
      TRINITY_NA=/brashear/$USER/trinity_out_dir/Trinity.fasta
      TRINITY_PROT=/brashear/$USER/Proteins/best_candidates.eclipsed_orfs_removed.pep
      #
      # parallel parameters
      #
      BLASTRUNS=2    # Number of independent BLAST runs
      BLASTTHREADS=16 # Number of BLAST Threads (BLASTTHREAD*BLASTRUNS=NCPUS)
      #
      mkdir -p $OUTDIR
      cd $OUTDIR
      #
      # Split file into sections so we can run blast independently
      #
      cp $TRINITY_PROT Trinity_protein.fasta
      perl $FASTA_SPLIT_HOME/fasta_splitter.pl -n-parts-sequence \
           $BLASTRUNS Trinity_protein.fasta
      ls -l
      #
      # TRINOTATE: Take protein file and run BLAST against SWISSPROT:
      #
      date
      PLACEON=0
      PART=1
      for F in Trinity_protein.part*
      do
        ((PLACETHROUGH= PLACEON + BLASTTHREADS - 1))
        dplace -o dplacelog$PLACEON -c $PLACEON-$PLACETHROUGH blastp \
          -query $F -num_threads $BLASTTHREADS -db $TRDB_SPROT -outfmt 6 \
          -max_target_seqs 1 \
          -out blastp_$PART.out \
          > blastp_$PART.log 2>&1 &
        ((PLACEON= PLACEON + BLASTTHREADS))
        ((PART=PART + 1))
      done
      wait
      cat blastp_*.out > Trinity_protein_blastp_all.out
      head -100 dplacelog*
      date
      ja -set $SCRATCH/$$.ja


      ********************
      ******* PFAM *******
      ********************


      #!/bin/bash
      #PBS -l ncpus=32
      #PBS -l walltime=96:00:00
      #PBS -j oe
      #PBS -q batch
      #PBS -N Trinotate_pfam
      #
      source /usr/share/modules/init/bash
      module load trinotate
      module unload perl
      module load fasta_splitter
      set -x
      ja $SCRATCH/$$.ja
      #
      # Data
      #
      OUTDIR=/brashear/$USER/Trinotate_PFAM
      TRINITY_NA=/brashear/$USER/trinity_out_dir/Trinity.fasta
      TRINITY_PROT=/brashear/$USER/Proteins/best_candidates.eclipsed_orfs_removed.pep
      #
      # parallel parameters
      #
      HMMERRUNS=2    # Number of independent hmmer runs
      HMMERTHREADS=16 # Number of hmmer Threads (BLASTTHREAD*HMMERRUNS=NCPUS)
      #
      mkdir -p $OUTDIR
      cd $OUTDIR
      #
      # Split file into sections so we can run blast independently
      #
      cp $TRINITY_PROT Trinity_protein.fasta
      perl $FASTA_SPLIT_HOME/fasta_splitter.pl -n-parts-sequence \
           $HMMERRUNS Trinity_protein.fasta
      ls -l
      #
      # TRINOTATE: Take protein file and run hmmscan against PFAM:
      #
      date
      PLACEON=0
      PART=1
      for F in Trinity_protein.part*
      do
        ((PLACETHROUGH= PLACEON + HMMERTHREADS - 1))
        dplace -o dplacelog$PLACEON -c $PLACEON-$PLACETHROUGH hmmscan \
          --cpu $HMMERTHREADS --domtblout PFAM_$PART.out \
          $TRDB_PFAM $F > pfam_$PART.log 2>&1 &
        ((PLACEON= PLACEON + HMMERTHREADS))
        ((PART=PART + 1))
      done
      wait
      ls -l
      cat PFAM_*.out > Trinity_protein_pfam_all.out
      head -100 dplacelog*
      date
      ja -set $SCRATCH/$$.ja


      ***********************
      ******* SIGNALP *******
      ***********************


      #!/bin/bash
      #PBS -l ncpus=16
      #PBS -l walltime=24:00:00
      #PBS -j oe
      #PBS -q batch
      #PBS -N Trinotate_signalp
      #
      source /usr/share/modules/init/bash
      module load trinotate
      module unload perl
      module load fasta_splitter
      set -x
      ja $SCRATCH/$$.ja
      #
      # Data
      #
      OUTDIR=/brashear/$USER/Trinotate_signalp
      TRINITY_NA=/brashear/$USER/trinity_out_dir/Trinity.fasta
      TRINITY_PROT=/brashear/$USER/Proteins/best_candidates.eclipsed_orfs_removed.pep
      #
      # parallel parameters
      #
      RUNS=16    # Number of independent runs
      THREADS=1  # Number of Threads ( THREADS * RUNS = NCPUS)
      #
      mkdir -p $OUTDIR
      cd $OUTDIR
      #
      # Split file into sections so we can run independently
      #
      cp $TRINITY_PROT Trinity_prot.fasta
      perl $FASTA_SPLIT_HOME/fasta_splitter.pl -n-parts-sequence \
           $RUNS Trinity_prot.fasta
      ls -l
      #
      # TRINOTATE: Take protein file and run signalp
      #
      date
      PLACEON=0
      PART=1
      for F in Trinity_prot.part*
      do
        ((PLACETHROUGH= PLACEON + THREADS - 1))
        dplace -o dplacelog$PLACEON -c $PLACEON-$PLACETHROUGH signalp \
          -f short -n signalp_$PART.out $F \
          > signalp_$PART.log 2>&1 &
        ((PLACEON= PLACEON + THREADS))
        ((PART=PART + 1))
      done
      wait
      ls -l
      cat signalp_*.out > Trinity_prot_signalp_all.out
      head -100 dplacelog*
      date
      ja -set $SCRATCH/$$.ja


      *********************
      ******* TMHMM *******
      *********************


      #!/bin/bash
      #PBS -l ncpus=16
      #PBS -l walltime=24:00:00
      #PBS -j oe
      #PBS -q batch
      #PBS -N Trinotate_tmhmm
      #
      source /usr/share/modules/init/bash
      module load trinotate
      module unload perl
      module load fasta_splitter
      module unload perl
      module load perl/5.12.3
      set -x
      ja $SCRATCH/$$/$$.ja
      #
      # Data
      #
      OUTDIR=/brashear/$USER/Trinotate_tmhmm
      TRINITY_NA=/brashear/$USER/trinity_out_dir/Trinity.fasta
      TRINITY_PROT=/brashear/$USER/Proteins/best_candidates.eclipsed_orfs_removed.pep
      #
      # parallel parameters
      #
      RUNS=16    # Number of independent runs
      THREADS=1 # Number of Threads ( THREADS * RUNS = NCPUS)
      #
      mkdir -p $OUTDIR
      cd $OUTDIR
      #
      # Split file into sections so we can run independently
      #
      cp $TRINITY_PROT Trinity_prot.fasta
      perl $FASTA_SPLIT_HOME/fasta_splitter.pl -n-parts-sequence \
           $RUNS Trinity_prot.fasta
      ls -l
      #
      # TRINOTATE: Take protein file and run tmhmm
      #
      date
      PLACEON=0
      PART=1
      for F in Trinity_prot.part*
      do
        ((PLACETHROUGH= PLACEON + THREADS - 1))
        dplace -o dplacelog$PLACEON -c $PLACEON-$PLACETHROUGH tmhmm --d \
         --short $F  > tmhmm_$PART.out 2>tmhmm_$PART.err &
        ((PLACEON= PLACEON + THREADS))
        ((PART=PART + 1))
      done
      wait
      ls -l
      cat tmhmm_*.out > Trinity_prot_tmhmm_all.out
      head -100 dplacelog*
      date
      ja -set $SCRATCH/$$.ja


      **********************************************************************
      ******* PLACE RESULTS INTO SQLITE DATABASE AND GENERATE REPORT *******
      **********************************************************************


      #!/bin/bash
      #PBS -l ncpus=16
      #PBS -l walltime=00:30:00
      #PBS -j oe
      #PBS -q debug
      #PBS -N Trinotate_database
      #
      source /usr/share/modules/init/bash
      module load trinotate
      module load trinity/r2013-08-14
      module load sqlite
      module unload perl
      module load perl/5.16.0
      set -x
      ja $SCRATCH/$$.ja
      #
      # Data
      #
      OUTDIR=/brashear/$USER/Trinotate_AnnotationDatabase
      TRINITY_NA=/brashear/$USER/trinity_out_dir/Trinity.fasta
      TRINITY_PROT=/brashear/$USER/Proteins/best_candidates.eclipsed_orfs_removed.pep
      TRINOTATE_blastp=/brashear/$USER/Trinotate_blastp/Trinity_protein_blastp_all.out
      TRINOTATE_blastx=/brashear/$USER/Trinotate_blastx/Trinity_na_blastx_all.out
      TRINOTATE_pfam=/brashear/$USER/Trinotate_PFAM/Trinity_protein_pfam_all.out
      TRINOTATE_signalp=/brashear/$USER/Trinotate_signalp/Trinity_prot_signalp_all.out
      TRINOTATE_tmhmm=/brashear/$USER/Trinotate_tmhmm/Trinity_prot_tmhmm_all.out
      TRINOTATE_rnammer=/brashear/ropelews/Trinotate_RNAmmer/Trinity.fasta.rnammer.gff
      #
      mkdir -p $OUTDIR
      cd $OUTDIR
      #
      # STEP A: Generate Gene/Transcript relationships
      #
      $TRINITY_HOME/util/get_Trinity_gene_to_trans_map.pl $TRINITY_NA \
          >  Trinity.fasta.gene_trans_map
      #
      # STEP B: Initialize SQLITE Database
      #
      cp $TRDB_SQLITE Trinotate.sqlite
      Trinotate.pl init --gene_trans_map Trinity.fasta.gene_trans_map \
          --transcript_fasta $TRINITY_NA --transdecoder_pep $TRINITY_PROT
      #
      # STEP C: Load Blast Results
      #
      # Old version (trinity/r2013-02-25)
      #Trinotate.pl LOAD_blast $TRINOTATE_blastp
      # New Version (trinity/r2013-08-14)
      Trinotate.pl LOAD_blastp $TRINOTATE_blastp
      Trinotate.pl LOAD_blastx $TRINOTATE_blastx
      #
      # STEP D: Load PFAM Results
      #
      Trinotate.pl LOAD_pfam $TRINOTATE_pfam
      #
      # STEP E: Load tmhmm Results
      #
      Trinotate.pl LOAD_tmhmm $TRINOTATE_tmhmm
      #
      # STEP F: Load signalp Results
      #
      Trinotate.pl LOAD_signalp $TRINOTATE_signalp
      #
      # STEP G: Load RNAmmer Results
      #
      Trinotate.pl LOAD_rnammer $TRINOTATE_rnammer
      #
      # STEP G: Generate annotation report
      #
      Trinotate.pl report > trinotate_annotation_report.xls
      #
      ja -set $SCRATCH/$$.ja

 

Tabix

 

Tabix is a generic indexer for TAB-delimited genome position files.

 

Installed on blacklight

 

Other resources

Website: http://samtools.sourceforge.net/tabix.html

 

Running Tabix

1. Create a batch job which

     1. Sets up the use of the module command in a batch job

     2. Loads the tabix module

           module load tabix

     3. Includes other commands to run Tabix

2. Submit the batch job with the qsub command

VCFtools

 

VCFtools is a tool whih provides easily accessible methods for working with complex genetic variation data in the form of VCF files.

 

Installed on blacklight

 

Other resources

Website: http://vcftools.sourceforge.net/

 

Running VCFtools

1. Create a batch job which

     1. Sets up the use of the module command in a batch job

     2. Loads the vcftools module

              module load vcftools

     3. Includes the other commands to run VCFtools

2. Submit the batch job with the qsub command

 

Fastqc

 

Fastqc is a quality control tool for high throughput sequence data.

 

Installed on blacklight

 

Other resources

Website: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

 

Running fastqc

1.  Create a batch job which

      1. Sets up the use of the module command in a batch job

      2. Loads the fastqc module

                module load fastqc

      3. Includes other commands to run fastqc

 2. Submit a batch job with the qsub command