Pittsburgh Supercomputing Center 

Advancing the state-of-the-art in high-performance computing,
communications and data analytics.

Trinotate

   Trinotate is an annotation method suitable for computationally assembled
   transcripts from RNAseq sequencing data.

Installed on blacklight

Other resources that may be helpful include:

    Website: http://trinotate.sourceforge.net/


Running Trinotate

1) Make Trinotate programs availiable for use
  
   The Trinotate process relies on a number of underlying program and
   sequence databases. To make all of these programs availiable for use
   use the following module command:

   module load trinotate

   This module will load a number of modules including: trinotate_db,
   ncbi-blast, signalp, tmhmm, hmmer, RNAmmer.   
 

2) General Usage:

   The general Trinotate process is as follows. With the transcripts:

   a) Run blastx with the transcripts against uniprot-swissprot
   b) Run RNAmmer with the transcripts
   c) Generate conceptual protein translations of the transcripts
      1) Run blastp with the conceptual protein translations against
         uniprot-swissprot
      2) Run hmmsearch with the conceptual protein translations against
         PFAM
      3) Run signalp with the conceptual protein translations
      4) Run tmhmm with the conceptual protein translations
   d) When the above runs are done, load them into a pre-populated
      SQLite database that contains annotation information (including go
      terms) linked to the uniprot-swissprot and pfam identifiers.     
   e) query the database and generate a report that can be viewed in
      spreadsheet software.

      In general, the blastx and blastp steps will take the most amount
      of time.

3) PBS Examples

   a) Below is a simple example that runs most of the Trinotate process in
      one PBS job. If you have many transcripts (> 10,000), we do not
      recommend running the entire process in one batch job in the way
      illustrated below.   

      #!/bin/bash
      #PBS -l ncpus=16
      #PBS -l walltime=24:00:00
      #PBS -j oe
      #PBS -q batch
      #PBS -N Trinotate
      #
      source /usr/share/modules/init/bash
      module load java
      module load trinity
      module load trinotate
      module load sqlite
      set -x
      ja $SCRATCH/$$.ja
      #
      # Sample Data
      #
      OUTDIR=/brashear/$USER/Trinotate
      PROT=/brashear/$USER/Protein/best_candidates.eclipsed_orfs_removed.pep
      NA=/brashear/$USER/trinity_out_dir/Trinity.fasta
      #
      mkdir -p $OUTDIR
      cd $OUTDIR
      #
      # Here we start with JUST transcripts and protein coding regions in
      # those transcripts.
      #
      # STEP 1: Take protein file and run BLAST against SWISSPROT:
      #
      blastp -query $PROT -db $TRDB_SPROT -num_threads 16 -max_target_seqs 1 \
       -outfmt 6 > TrinotateBlast.out
      #
      # STEP 2: Take protein file and run hmmscan against PFAM-A:
      #
      hmmscan --cpu 16 --domtblout TrinotatePFAM.out $TRDB_PFAM $PROT > pfam.log
      #
      # STEP 3: Run signalp to predict signal peptides:
      #
      signalp -f short -n signalp.out $PROT
      #
      # STEP 4: Run tmhmm to predict transmembrane regions:
      #
      tmhmm --short < $PROT > tmhmm.out
      #
      # STEP 5: Take nucleic acid file and run BLAST against SWISSPROT:
      #
      blastx \
         -query $NA -num_threads $BLASTTHREADS -db $TRDB_SPROT -outfmt 6 \
         -max_target_seqs 1 \
         -out TrinotateBlastx.out
      #
      # STEP 6: Run tmhmm to predict transmembrane regions:
      #
      $TRINOTATE_HOME/util/rnammer_support/RnammerTranscriptome.pl \
       --transcriptome $NA --path_to_rnammer $RNAMMER_HOME/rnammer \
       >& RnammerTranscriptome.log
      #
      # STEP 7: Generate Gene/Transcript relationships
      #
      $TRINITY_HOME/util/get_Trinity_gene_to_trans_map.pl $NA \
          >  Trinity.fasta.gene_trans_map
      #
      # STEP 8: Initialize SQLITE Database
      #
      cp $TRDB_SQLITE Trinotate.sqlite
      Trinotate.pl init --gene_trans_map Trinity.fasta.gene_trans_map \
         --transcript_fasta $NA --transdecoder_pep $PROT
      #
      # STEP 9: Load Blast Results
      #
      Trinotate.pl LOAD_blast TrinotateBlast.out
      #
      # STEP 10: Load PFAM Results
      #
      Trinotate.pl LOAD_pfam TrinotatePFAM.out
      #
      # STEP 11: Load tmhmm Results
      #
      Trinotate.pl LOAD_tmhmm tmhmm.out
      #
      # STEP 12: Load signalp Results
      #
      Trinotate.pl LOAD_signalp signalp.out
      #
      # STEP 13: Load rnammer Results
      #
      Trinotate.pl LOAD_rnammer Trinity.fasta.rnammer.gff
      #
      # STEP 14: Load blastx Results
      #
      Trinotate.pl LOAD_blastx TrinotateBlastx.out
      #
      # STEP 15: Generate annotation report
      #
      Trinotate.pl report > trinotate_annotation_report.xls
      #
      ja -set $SCRATCH/$$.ja


   b) Below are examples runs the Trinotate processes as individual PBS jobs,
      This is the recommended method if you have a large number of transcripts

      **********************
      ******* BLASTX *******
      **********************

      #!/bin/bash
      #PBS -l ncpus=32
      #PBS -l walltime=24:00:00
      #PBS -j oe
      #PBS -q batch
      #PBS -N Trinotate_blastx
      #
      source /usr/share/modules/init/bash
      module load trinotate
      module unload perl
      module load fasta_splitter
      set -x
      ja $SCRATCH/$$.ja
      #
      # Data
      #
      OUTDIR=/brashear/$USER/Trinotate_blastx
      TRINITY_NA=/brashear/$USER/trinity_out_dir/Trinity.fasta
      TRINITY_PROT=/brashear/$USER/Proteins/best_candidates.eclipsed_orfs_removed.pep
      #
      # parallel parameters
      #
      BLASTRUNS=2    # Number of independent BLAST runs
      BLASTTHREADS=16 # Number of BLAST Threads ( BLASTTHREAD*BLASTRUNS = NCPUS)
      #
      mkdir -p $OUTDIR
      cd $OUTDIR
      #
      # Split file into sections so we can run blast independently
      #
      cp $TRINITY_NA Trinity_na.fasta
      perl $FASTA_SPLIT_HOME/fasta_splitter.pl -n-parts-sequence \
           $BLASTRUNS Trinity_na.fasta
      ls -l
      #
      # TRINOTATE: Take protein file and run BLAST against SWISSPROT:
      #
      date
      PLACEON=0
      PART=1
      for F in Trinity_na.part*
      do
        ((PLACETHROUGH= PLACEON + BLASTTHREADS - 1))
        dplace -o dplacelog$PLACEON -c $PLACEON-$PLACETHROUGH blastx \
          -query $F -num_threads $BLASTTHREADS -db $TRDB_SPROT -outfmt 6 \
          -max_target_seqs 1 \
          -out blastx_$PART.out \
          > blastx_$PART.log 2>&1 &
        ((PLACEON= PLACEON + BLASTTHREADS))
        ((PART=PART + 1))
      done
      wait
      ls -l
      cat blastx_*.out > Trinity_na_blastx_all.out
      head -100 dplacelog*
      date
      ja -set $SCRATCH/$$.ja

      ***********************
      ******* RNAMMER *******
      ***********************

      #!/bin/bash
      #PBS -l ncpus=16
      #PBS -l walltime=24:00:00
      #PBS -j oe
      #PBS -q batch
      #PBS -N Trinotate_rnammer
      #
      source /usr/share/modules/init/bash
      module load trinotate
      module unload perl
      module load perl/5.12.5-threads
      set -x
      ja $SCRATCH/$$.ja
      #
      # Data
      #
      OUTDIR=/brashear/$USER/Trinotate_RNAmmer
      TRINITY_NA=/brashear/$USER/trinity_out_dir/Trinity.fasta
      TRINITY_PROT=/brashear/$USER/Proteins/best_candidates.eclipsed_orfs_removed.pep
      #
      mkdir -p $OUTDIR
      cd $OUTDIR
      ls -l
      #
      # TRINOTATE: Take na file and run rnammer
      #
      date
      $TRINOTATE_HOME/util/rnammer_support/RnammerTranscriptome.pl \
        --transcriptome $TRINITY_NA --path_to_rnammer $RNAMMER_HOME/rnammer \
        >& RnammerTranscriptome.log
      date
      ja -set $SCRATCH/$$.ja


      **********************
      ******* BLASTP *******
      **********************


      #!/bin/bash
      #PBS -l ncpus=32
      #PBS -l walltime=96:00:00
      #PBS -j oe
      #PBS -q batch
      #PBS -N Trinotate_blastp
      #
      source /usr/share/modules/init/bash
      module load trinotate
      module unload perl
      module load fasta_splitter
      set -x
      ja $SCRATCH/$$.ja
      #
      # Data
      #
      OUTDIR=/brashear/$USER/Trinotate_blastp
      TRINITY_NA=/brashear/$USER/trinity_out_dir/Trinity.fasta
      TRINITY_PROT=/brashear/$USER/Proteins/best_candidates.eclipsed_orfs_removed.pep
      #
      # parallel parameters
      #
      BLASTRUNS=2    # Number of independent BLAST runs
      BLASTTHREADS=16 # Number of BLAST Threads (BLASTTHREAD*BLASTRUNS=NCPUS)
      #
      mkdir -p $OUTDIR
      cd $OUTDIR
      #
      # Split file into sections so we can run blast independently
      #
      cp $TRINITY_PROT Trinity_protein.fasta
      perl $FASTA_SPLIT_HOME/fasta_splitter.pl -n-parts-sequence \
           $BLASTRUNS Trinity_protein.fasta
      ls -l
      #
      # TRINOTATE: Take protein file and run BLAST against SWISSPROT:
      #
      date
      PLACEON=0
      PART=1
      for F in Trinity_protein.part*
      do
        ((PLACETHROUGH= PLACEON + BLASTTHREADS - 1))
        dplace -o dplacelog$PLACEON -c $PLACEON-$PLACETHROUGH blastp \
          -query $F -num_threads $BLASTTHREADS -db $TRDB_SPROT -outfmt 6 \
          -max_target_seqs 1 \
          -out blastp_$PART.out \
          > blastp_$PART.log 2>&1 &
        ((PLACEON= PLACEON + BLASTTHREADS))
        ((PART=PART + 1))
      done
      wait
      cat blastp_*.out > Trinity_protein_blastp_all.out
      head -100 dplacelog*
      date
      ja -set $SCRATCH/$$.ja


      ********************
      ******* PFAM *******
      ********************


      #!/bin/bash
      #PBS -l ncpus=32
      #PBS -l walltime=96:00:00
      #PBS -j oe
      #PBS -q batch
      #PBS -N Trinotate_pfam
      #
      source /usr/share/modules/init/bash
      module load trinotate
      module unload perl
      module load fasta_splitter
      set -x
      ja $SCRATCH/$$.ja
      #
      # Data
      #
      OUTDIR=/brashear/$USER/Trinotate_PFAM
      TRINITY_NA=/brashear/$USER/trinity_out_dir/Trinity.fasta
      TRINITY_PROT=/brashear/$USER/Proteins/best_candidates.eclipsed_orfs_removed.pep
      #
      # parallel parameters
      #
      HMMERRUNS=2    # Number of independent hmmer runs
      HMMERTHREADS=16 # Number of hmmer Threads (BLASTTHREAD*HMMERRUNS=NCPUS)
      #
      mkdir -p $OUTDIR
      cd $OUTDIR
      #
      # Split file into sections so we can run blast independently
      #
      cp $TRINITY_PROT Trinity_protein.fasta
      perl $FASTA_SPLIT_HOME/fasta_splitter.pl -n-parts-sequence \
           $HMMERRUNS Trinity_protein.fasta
      ls -l
      #
      # TRINOTATE: Take protein file and run hmmscan against PFAM:
      #
      date
      PLACEON=0
      PART=1
      for F in Trinity_protein.part*
      do
        ((PLACETHROUGH= PLACEON + HMMERTHREADS - 1))
        dplace -o dplacelog$PLACEON -c $PLACEON-$PLACETHROUGH hmmscan \
          --cpu $HMMERTHREADS --domtblout PFAM_$PART.out \
          $TRDB_PFAM $F > pfam_$PART.log 2>&1 &
        ((PLACEON= PLACEON + HMMERTHREADS))
        ((PART=PART + 1))
      done
      wait
      ls -l
      cat PFAM_*.out > Trinity_protein_pfam_all.out
      head -100 dplacelog*
      date
      ja -set $SCRATCH/$$.ja


      ***********************
      ******* SIGNALP *******
      ***********************


      #!/bin/bash
      #PBS -l ncpus=16
      #PBS -l walltime=24:00:00
      #PBS -j oe
      #PBS -q batch
      #PBS -N Trinotate_signalp
      #
      source /usr/share/modules/init/bash
      module load trinotate
      module unload perl
      module load fasta_splitter
      set -x
      ja $SCRATCH/$$.ja
      #
      # Data
      #
      OUTDIR=/brashear/$USER/Trinotate_signalp
      TRINITY_NA=/brashear/$USER/trinity_out_dir/Trinity.fasta
      TRINITY_PROT=/brashear/$USER/Proteins/best_candidates.eclipsed_orfs_removed.pep
      #
      # parallel parameters
      #
      RUNS=16    # Number of independent runs
      THREADS=1  # Number of Threads ( THREADS * RUNS = NCPUS)
      #
      mkdir -p $OUTDIR
      cd $OUTDIR
      #
      # Split file into sections so we can run independently
      #
      cp $TRINITY_PROT Trinity_prot.fasta
      perl $FASTA_SPLIT_HOME/fasta_splitter.pl -n-parts-sequence \
           $RUNS Trinity_prot.fasta
      ls -l
      #
      # TRINOTATE: Take protein file and run signalp
      #
      date
      PLACEON=0
      PART=1
      for F in Trinity_prot.part*
      do
        ((PLACETHROUGH= PLACEON + THREADS - 1))
        dplace -o dplacelog$PLACEON -c $PLACEON-$PLACETHROUGH signalp \
          -f short -n signalp_$PART.out $F \
          > signalp_$PART.log 2>&1 &
        ((PLACEON= PLACEON + THREADS))
        ((PART=PART + 1))
      done
      wait
      ls -l
      cat signalp_*.out > Trinity_prot_signalp_all.out
      head -100 dplacelog*
      date
      ja -set $SCRATCH/$$.ja


      *********************
      ******* TMHMM *******
      *********************


      #!/bin/bash
      #PBS -l ncpus=16
      #PBS -l walltime=24:00:00
      #PBS -j oe
      #PBS -q batch
      #PBS -N Trinotate_tmhmm
      #
      source /usr/share/modules/init/bash
      module load trinotate
      module unload perl
      module load fasta_splitter
      module unload perl
      module load perl/5.12.3
      set -x
      ja $SCRATCH/$$/$$.ja
      #
      # Data
      #
      OUTDIR=/brashear/$USER/Trinotate_tmhmm
      TRINITY_NA=/brashear/$USER/trinity_out_dir/Trinity.fasta
      TRINITY_PROT=/brashear/$USER/Proteins/best_candidates.eclipsed_orfs_removed.pep
      #
      # parallel parameters
      #
      RUNS=16    # Number of independent runs
      THREADS=1 # Number of Threads ( THREADS * RUNS = NCPUS)
      #
      mkdir -p $OUTDIR
      cd $OUTDIR
      #
      # Split file into sections so we can run independently
      #
      cp $TRINITY_PROT Trinity_prot.fasta
      perl $FASTA_SPLIT_HOME/fasta_splitter.pl -n-parts-sequence \
           $RUNS Trinity_prot.fasta
      ls -l
      #
      # TRINOTATE: Take protein file and run tmhmm
      #
      date
      PLACEON=0
      PART=1
      for F in Trinity_prot.part*
      do
        ((PLACETHROUGH= PLACEON + THREADS - 1))
        dplace -o dplacelog$PLACEON -c $PLACEON-$PLACETHROUGH tmhmm --d \
         --short $F  > tmhmm_$PART.out 2>tmhmm_$PART.err &
        ((PLACEON= PLACEON + THREADS))
        ((PART=PART + 1))
      done
      wait
      ls -l
      cat tmhmm_*.out > Trinity_prot_tmhmm_all.out
      head -100 dplacelog*
      date
      ja -set $SCRATCH/$$.ja


      **********************************************************************
      ******* PLACE RESULTS INTO SQLITE DATABASE AND GENERATE REPORT *******
      **********************************************************************


      #!/bin/bash
      #PBS -l ncpus=16
      #PBS -l walltime=00:30:00
      #PBS -j oe
      #PBS -q debug
      #PBS -N Trinotate_database
      #
      source /usr/share/modules/init/bash
      module load trinotate
      module load trinity/r2013-08-14
      module load sqlite
      module unload perl
      module load perl/5.16.0
      set -x
      ja $SCRATCH/$$.ja
      #
      # Data
      #
      OUTDIR=/brashear/$USER/Trinotate_AnnotationDatabase
      TRINITY_NA=/brashear/$USER/trinity_out_dir/Trinity.fasta
      TRINITY_PROT=/brashear/$USER/Proteins/best_candidates.eclipsed_orfs_removed.pep
      TRINOTATE_blastp=/brashear/$USER/Trinotate_blastp/Trinity_protein_blastp_all.out
      TRINOTATE_blastx=/brashear/$USER/Trinotate_blastx/Trinity_na_blastx_all.out
      TRINOTATE_pfam=/brashear/$USER/Trinotate_PFAM/Trinity_protein_pfam_all.out
      TRINOTATE_signalp=/brashear/$USER/Trinotate_signalp/Trinity_prot_signalp_all.out
      TRINOTATE_tmhmm=/brashear/$USER/Trinotate_tmhmm/Trinity_prot_tmhmm_all.out
      TRINOTATE_rnammer=/brashear/ropelews/Trinotate_RNAmmer/Trinity.fasta.rnammer.gff
      #
      mkdir -p $OUTDIR
      cd $OUTDIR
      #
      # STEP A: Generate Gene/Transcript relationships
      #
      $TRINITY_HOME/util/get_Trinity_gene_to_trans_map.pl $TRINITY_NA \
          >  Trinity.fasta.gene_trans_map
      #
      # STEP B: Initialize SQLITE Database
      #
      cp $TRDB_SQLITE Trinotate.sqlite
      Trinotate.pl init --gene_trans_map Trinity.fasta.gene_trans_map \
          --transcript_fasta $TRINITY_NA --transdecoder_pep $TRINITY_PROT
      #
      # STEP C: Load Blast Results
      #
      # Old version (trinity/r2013-02-25)
      #Trinotate.pl LOAD_blast $TRINOTATE_blastp
      # New Version (trinity/r2013-08-14)
      Trinotate.pl LOAD_blastp $TRINOTATE_blastp
      Trinotate.pl LOAD_blastx $TRINOTATE_blastx
      #
      # STEP D: Load PFAM Results
      #
      Trinotate.pl LOAD_pfam $TRINOTATE_pfam
      #
      # STEP E: Load tmhmm Results
      #
      Trinotate.pl LOAD_tmhmm $TRINOTATE_tmhmm
      #
      # STEP F: Load signalp Results
      #
      Trinotate.pl LOAD_signalp $TRINOTATE_signalp
      #
      # STEP G: Load RNAmmer Results
      #
      Trinotate.pl LOAD_rnammer $TRINOTATE_rnammer
      #
      # STEP G: Generate annotation report
      #
      Trinotate.pl report > trinotate_annotation_report.xls
      #
      ja -set $SCRATCH/$$.ja

 

Tabix

 

Tabix is a generic indexer for TAB-delimited genome position files.

 

Installed on blacklight

 

Other resources

Website: http://samtools.sourceforge.net/tabix.html

 

Running Tabix

1. Create a batch job which

     1. Sets up the use of the module command in a batch job

     2. Loads the tabix module

           module load tabix

     3. Includes other commands to run Tabix

2. Submit the batch job with the qsub command

VCFtools

 

VCFtools is a tool whih provides easily accessible methods for working with complex genetic variation data in the form of VCF files.

 

Installed on blacklight

 

Other resources

Website: http://vcftools.sourceforge.net/

 

Running VCFtools

1. Create a batch job which

     1. Sets up the use of the module command in a batch job

     2. Loads the vcftools module

              module load vcftools

     3. Includes the other commands to run VCFtools

2. Submit the batch job with the qsub command

 

Fastqc

 

Fastqc is a quality control tool for high throughput sequence data.

 

Installed on blacklight

 

Other resources

Website: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

 

Running fastqc

1.  Create a batch job which

      1. Sets up the use of the module command in a batch job

      2. Loads the fastqc module

                module load fastqc

      3. Includes other commands to run fastqc

 2. Submit a batch job with the qsub command

 

PSI4

 

PSI4 is an open-source suite of ab initio quantum chemistry programs for the efficient, high-accuracy simulations of a variety of molecular properties.

 

Installed on blacklight

 

Other resources

Website: http://www.psicode.org/

 

Running PSI4

 

1. Create a batch job which

     1. Sets up the use of the module command in a batch job

     2. Loads the psi4 module

            module load psi4

     3. Includes other commands to run psi4

2. Submit the batch job with the qsub command

 

Sample Batch Job

 

#!/bin/csh
#PBS -l ncpus=16
#PBS -l walltime=30:00
#PBS -j oe
#PBS -q batch

set echo

# move to  directory from where job was submitted
# input files are there
cd $PBS_O_WORKDIR
echo $PBS_O_WORKDIR

# set up the module command
source /usr/share/modules/init/csh

# load the psi4 module
module load psi4

setenv PSI_SCRATCH $SCRATCH_RAMDISK

#Adjust OMP_NUM_THREADS and MKL_NUM_THREADS according to your problem size
setenv OMP_NUM_THREADS 4
setenv MKL_NUM_THREADS 4
ja
omplace -nt $OMP_NUM_THREADS psi4 -i input.dat -o out.dat
ja -cshrlt

 

Input file for batch job

Output file for batch job