RUNNING THE PROGRAM
This README file provided by the TransTerm suppliers.
You need 2 input files: one with genome sequence and the other one with gene coordinates. Please note: the program is based on statistical analyses of the whole genome and you need the whole genome (or a significant part of it) to get correct results.
The first input file has the following format:
>header1 sequence1 >header2 sequence2 ... >last header last sequence
The each header and sequence corresponds to one bacterial chromosome or plasmid. The header has the following format:
something_something_id_something
id is an id of the chromosome or plasmid, for example:
>Borrelia_burgdorferi_3615_main 7/12/99 TAAATATAATTTAATAGTATAAAAAAAATTAAATCAAATTAATAATAGTTTAAAAAACTG TTTGTATAATATAATATTATTATATATAATATTAAGCAACTACTATGATACTAATGAAGT ATAGTGCTATTTTATTAATATGTAGCGTTAATTTATTTTGTTTTCAAAATAAATTAACTA .........sequence of the main chromosome......... >Borrelia_burgdorferi_4091_cp26(plasmid_B) 7/12/99 TTTAAAACTTTTCTATTGGATAGATTTTATACAAAGAAGGTAATAATGTATAAACAACAA TATTTTATTTCTGGCAAGGTGCAAGGTGTTGGTTTTAGATTTTTCACAGAGCAAATAGCA AATAATATGAAACTAAAAGGATTTGTAAAAAATCTCAACGATGGAAGGGTAGAAATTGTA .........sequence of the cp26 plasmid............
See file GBB.1con (from demo) for a more complete example.
The second input file has four fields separated by tabs:
gene name start stop id
id is a chromosome or plasmid id where the gene is located, start and stop are gene coordinates. If gene is located on the "+" strand, than start is less than stop. Otherwise stop is less than start.
Example:
ORF00003 910175 909845 3615 ORF00004 908407 909588 3615 ORFB0002 15107 13896 4091
See file GBB.coords (from demo) for a more complete example.
Run the program:
srun perl ../src/TransTerm -s first_input file \ -c second input file -o output file -g
Option -g enables searching for terminators that have gaps in the stem. If you don't need them, don't use -g, the program will run faster.
UNDERSTANDING OUTPUT
For an example output file, see GBB.out from the demo.
Look at the output file. Is it empty? If so, it can be because some bacteria don't have rho-independent terminators or it may be because something gone wrong. Try to decrease confidence cutoff:
In in the file src/smooth_confidence.perl change lines:
$tail_to_tail_prob_cutoff = 90; $head_to_tail_prob_cutoff = 90; into: $tail_to_tail_prob_cutoff = 0; $head_to_tail_prob_cutoff = 0;
and rerun the program. Is the output is still empty?
Check the input files. Did you tried to run demo? Not all bacteria has rho-independent transcription terminators, but some sequences look similar to them and will be outputted if confidence cutoff is low.
The each line of the output file (except for lines that starts with *****) corresponds to one found terminator.
The first field is a name of the intergenic region where the terminator is located.
The next two fields are terminators coordinates in the genome.
The fourth field is a DNA strand where the terminator is located: -1 means "-", 1 means "+" and 0 means both strands.
The fifth field is a terminator confidence (in %): probability that the found structure is in fact transcription terminator. We recommend to take into account only those terminators with confidence 98% or higher in order to avoid false positives.
The next field is a stem length.
The next field is a position of a gap in the stem. If it is zero, than the hairpin doesn't have a gap.
The last field indicates in which chromosome or plasmid the terminator is located.
Look at the fifth column of the output file. Are there many (at least 20-30) lines where confidence is higher than 90? If no, the given genome do not have substantial amount of rho-independent terminators. In this case the found terminators can be only considered to be terminator candidates with unknown confidence because our approach uses statistical analyses to estimate confidence. If a number of terminators is small, estimation may be incorrect.
Some additional files will be created during calculation. Ignore them. You can delete them after the calculation is done.