Pittsburgh Supercomputing Center 

Advancing the state-of-the-art in high-performance computing,
communications and data analytics.

FASTA Splitter

 

 

A perl script that divides a large FASTA file into a set of smaller equally sized files.

It can be used to get parallelization from tools that are normally serial, but repeat the same operation on each sequence in a fasta file. It can also be used for threaded tools that repeat the same operation on each sequence in a fasta file, but do not do scale beyond 16 threads.

This tool was written by Kirill Kryukov in Saitou lab, NIG.

Installed on blacklight.

Other resources that may be helpful include:

Website: http://kirill-kryukov.com/study/tools/fasta-splitter/

Running fasta_splitter

1) Make fasta_splitter availiable for use
a) blacklight:
The fasta_splitter program will be made available for use through the module command. To load the fasta_splitter module enter:

module load fasta_splitter

2) General Usage:

perl fasta_splitter.pl method [options] fastafile

Where "method" is one of:

-n-parts-total N - split into N parts of similar total size
-n-parts-sequence N - split into N parts of similar sequence size
-part-total-size N - split into parts of at most N bytes each
-part-sequence-size N - split into parts containing at most N bp each

Options:

-line-length N - output line lenght, 60 by default
-eol [dos|mac|unix] - end-of-line of the output, unix by default

Example A) to split a fasta file (prot.fa) into four parts (based on sequence lengths):

perl $FASTA_SPLIT_HOME/fasta_splitter.pl -n-parts-sequence 4 prot.fa

Example B) to split a fasta file (prot.fa) into three parts (based on number of sequences):

perl $FASTA_SPLIT_HOME/fasta_splitter.pl -n-parts-total 3 prot.fa

3) PBS Examples (blacklight)

a) This example illustrates how to use fasta_splitter to run a serial program (tmhmm) across 16 cores:

#!/bin/bash #PBS -l ncpus=16
#PBS -l walltime=00:30:00 #PBS -j oe
#PBS -q batch
#PBS -N Trinotate_tmhmm
#
ja $SCRATCH/$$.ja
source /usr/share/modules/init/bash
module load trinotate
module unload perl
module load fasta_splitter
module unload perl
module load perl/5.12.3
set -x
#
# Data
#
OUTDIR=/brashear/$USER/Trinity_Output/Trinotate_tmhmm
TRINITY_PROT=/brashear/$USER/Trinity_Output/Proteins/best_candidates.eclipsed_orfs_removed.pep
#
# parallel parameters
#
RUNS=16 # Number of independent runs
THREADS=1 # Number of Threads ( THREADS * RUNS = NCPUS)
#
mkdir -p $OUTDIR
cd $OUTDIR
#
# Split file into sections so we can run independently
#
cp $TRINITY_PROT Trinity_prot.fasta
perl $FASTA_SPLIT_HOME/fasta_splitter.pl -n-parts-sequence $RUNS Trinity_prot.fasta
#
# TRINOTATE: Take protein file and run tmhmm
#
date
PLACEON=0
PART=1
for F in Trinity_prot.part*
do
((PLACETHROUGH= PLACEON + THREADS - 1))
dplace -o dplacelog$PLACEON -c $PLACEON-$PLACETHROUGH tmhmm --d \
--short $F > tmhmm_$PART.out 2>tmhmm_$PART.err &
(PLACEON= PLACEON + THREADS))
((PART=PART + 1))
done
wait cat tmhmm_*.out > Trinity_prot_tmhmm_all.out
head -100 dplacelog*
date ja -set $SCRATCH/$$.ja

b) This example illustrates how to use fasta_splitter to scale blast, using 2 parallel runs at 16 threads each:

#!/bin/bash
#PBS -l ncpus=32
#PBS -l walltime=96:00:00
#PBS -j oe
#PBS -q batch
#PBS -N Trinotate_blastp
#
source /usr/share/modules/init/bash
module load trinotate
module unload perl
module load fasta_splitter
set -x ja $SCRATCH/$$.ja
#
# Data
#
OUTDIR=/brashear/$USER/Trinity_Output/Trinotate_Blastp
TRINITY_PROT=/brashear/$USER/Trinity_Output/Proteins/best_candidates.eclipsed_orfs_removed.pep
#
# parallel parameters
#
BLASTRUNS=2 # Number of independent BLAST runs
BLASTTHREADS=16 # Number of BLAST Threads ( BLASTTHREAD * BLASTRUNS = NCPUS)
#
mkdir -p $OUTDIR
cd $OUTDIR
#
# Split file into sections so we can run blast independently
#
cp $TRINITY_PROT Trinity_protein.fasta
perl $FASTA_SPLIT_HOME/fasta_splitter.pl -n-parts-sequence $BLASTRUNS Trinity_protein.fasta
ls -l
#
# TRINOTATE: Take protein file and run BLAST against SWISSPROT:
#
date
PLACEON=0
PART=1
for F in Trinity_protein.part*
do
((PLACETHROUGH= PLACEON + BLASTTHREADS - 1)) dplace -o dplacelog$PLACEON -c $PLACEON-$PLACETHROUGH blastp \
-query $F -num_threads $BLASTTHREADS -db $TRDB_SPROT -outfmt 6 \ -max_target_seqs 1 \
-out blastp_$PART.out \
> blastp_$PART.log 2>&1 &
((PLACEON= PLACEON + BLASTTHREADS))
((PART=PART + 1))
done
wait
cat blastp_*.out > Trinity_protein_blastp_all.out
head -100 dplacelog*
date
ja -set $SCRATCH/$$.ja

 

Stay Connected

Stay Connected with PSC!

facebook 32 twitter 32 google-Plus-icon