READSEQ Documentation Provided by the Author

READSEQ - Reformat nucleic/ protein sequence files into another format

Copyright 1990 by D.G.Gilbert
Biology dept., Indiana University, Bloomington, In 47405
e-mail: gilbertd@bio.indiana.edu

READSEQ will read in formatted sequence files and convert the sequence information in the files into another file that has a different format. Readseq is particularly useful as it automatically detects many sequence formats, and interconverts among them.

Formats which readseq currently understands:

* IG/Stanford, used by Intelliigenetics and others
* GenBank/GB, genbank flatfile format
* NBRF format
* EMBL, EMBL flatfile format
* GCG, single sequence format of GCG software
* DNAStrider, for common Mac program
* Fitch format, limited use
* Pearson/Fasta, a common format used by Fasta programs and others
* Zuker format, limited use. Input only.
* Olsen, format printed by Olsen VMS sequence editor. Input only.
* Phylip3.2, sequential format for Phylip programs
* Phylip, interleaved format for Phylip programs (v3.3, v3.4)
* Plain/Raw, sequence data only (no name, document, numbering)
+ MSF multi sequence format used by GCG software
+ PAUP's multiple sequence (NEXUS) format
+ PIR/CODATA format used by PIR
+ ASN.1 format used by NCBI
+ Pretty print with various options for nice looking output. Output only.

See the included "Formats" file for detail on file formats.

USAGE:

The brief usage of readseq is as follows. The " [ ] " denote optional parts of the syntax. Usage:

readseq [-options] in.seq > out.seq

Options:

a[ll]         			select All sequences
-c[aselower]   			change to lower case
-C[ASEUPPER]   	        	change to UPPER CASE
-degap[=-]     			remove gap symbols
-i[tem=2,3,4]  			select Item number(s) from several
-l[ist]        			List sequences only
-o[utput=]out.seq  		redirect Output
-p[ipe]        			Pipe (command line, stdout)
-r[everse]     			change to Reverse-complement
-v[erbose]     			Verbose progress
-f[ormat=]#    			Format number for output,  or
-f[ormat=]Name 		        Format name for output:

       	1. IG/Stanford         	10. Olsen (in-only)
       	2. GenBank/GB     	11. Phylip3.2
       	3. NBRF                  	12. Phylip
       	4. EMBL                  	13. Plain/Raw
       	5. GCG                   	14. PIR/CODATA
       	6. DNAStrider            	15. MSF
       	7. Fitch                 	16. ASN.1
       	8. Pearson/Fasta    	17. PAUP
       	9. Zuker                 	18. Pretty (out-only)

Pretty format options:

-wid[th]=#            		sequence line width
-tab=#                		left indent
-col[space]=#         		column space within sequence line
                                  on output
-gap[count]           		count gap chars in sequence numbers
-nameleft, -nameright[=#]   	name on left/right 
                                  side [=max width]
-nametop              		name at top/bottom
-numleft, -numright   		seq index on left/right side
-numtop, -numbot      		index on top/bottom
-match[=.]            		use match base for 2..n species
-inter[line=#]       		blank line(s) between sequence 
                                   blocks

Examples:

To use interactive prompting:

readseq

Convert all of two input files to one GenBank format output file:

readseq my1st.seq my 2nd.seq -all -format=genbank -output=my.gb

Output to standard output a file in a pretty format:

readseq my.seq -all -form=pretty -nameleft=3 -numleft=3 -numleft \ 
   -numright  -numtop -match

Select 4 items from input, degap, reverse, and uppercase:

readseq my.seq -item=9, 8, 3, 2 -degap -CASE -rev -f=msf -out=my.rev

In use, readseq will respond to command line arguments, or to interactive use. Command line arguments cannot be combined but must each follow a switch character (-). In this release, the command line options are now words, with an equals (=) to separate parameter(s) from he command. You cannot put a space between a command and its parameter, as is usual for Unix programs (this is to preserve compatibility with VMS). The command line syntax of the earlier versions is still supported.

See the file Formats for details of the sequence formats which are supported by readseq. The auto-detection feature of readseq which distinguishes these formats looks for some of the unique keywords and symbols that are found in each format. It is not infallible at this, though it attempts to exclude unknown formats. In general, if you feed to readseq a sequence file that you know is one of these common formats, you are okay. If you feed it data that might be oddball formats, or non-sequence data, you might well get garbage results. Also, different developers are always thinking up minor twists on these common formats (like PAUP requiring a blank line between blocks of Phylip format, or IG adding form feeds between sequences), which may cause hassles.

In general, output supports only minimal subsets of each format needed for sequence data exchanges. Features, descriptions and other format-unique information is discarded.

The pretty format requires additional options to generate a nice output. Try the various pretty options to see what you like. Pretty format is OUTPUT only, readseq cannot read a Pretty format file.

Readseq is NOT optimized for LARGE files. It generally makes several reads thru each input file (one per sequence output at present, future version may optimize this). It should handle input and output files and sequences of any size, but will slow down quite a bit for very large (multi megabyte) sized files. It is NOT recommended for converting data banks or large subsets there-of. It is primarily directed at the small files that researchers use to maintain their personal data, which they frequently need to interconvert for the various analysis programs which so frequently require a special format.

Users of Olsen multi sequence editor (VMS). The Olsen format here is produced with the print command:

print/out=some.file

Use Genbank output from readseq to produce a format that this editor can read, and use the command

load/genbank some.file

Dan Davison has a VMS program that will convert to/from the Olsen native binary data format. E-mail davison@uh.edu

Warning: Phylip format input is now supported (30Dec92), however the auto-detection of Phylip format is very probabilistic and messy, especially distinguishing sequential from interleaved versions. It is not recommended that one use readseq to convert files from Phylip format to others unless essential.

This program is available thru Internet gopher, as

    gopher ftp.bio.indiana.edu
    browse into the IUBio-Software+Data/molbio/readseq/ folder
    select the readseq.shar document

Or thru anonymous FTP in this manner:

my_computer> ftp  ftp.bio.indiana.edu  (or IP address 129.79.224.25)
 username:  anonymous
 password:  my_username@my_computer
ftp> cd molbio/readseq
ftp> get readseq.shar
ftp> bye

Readseq.shar is a Unix shell archive of the readseq files. This file can be edited by any text editor to reconstitute the original files, for those who do not have a Unix system or an Unshar program. Read the top of this .shar file for further instructions.

There are also pre-compiled executables for the following computers: Silicon Graphics Iris, Sparc (Sun Sparcstation & clones), VMS-Vax, Macintosh. Use binary ftp to transfer these, except Macintosh. The Mac version is just the command-line program in a window, not very handy.

C source files:

readseq.c, ureadseq.c, ureadasn.c, ureadseq.h

Document files:

  • Readme (this doc)
  • Formats (description of sequence file formats)
  • add.gdemenu (GDE program users can add this to the .GDEmenu file)
  • Stdfiles -- test sequence files
  • Makefile -- Unix make file
  • Make.com -- VMS make file
  • *.std -- files for testing validity of readseq

Recent changes

(see also readseq.c for all history of changes):

4 May 92

+ added 32 bit CRC checksum as alternative to GCG 6.5bit checksum
    Aug. 92
= fixed Olsen format input to handle files w/ more sequences,
    not to mess up when more than one seq has same identifier,
    and to convert number masks to symbols.
= IG format fix to understand ^L
30 Dec. 92
* revised command-line & interactive interface.  
  Suggested form is now
    readseq infile -format=genbank -output=outfile -item=1,3,4 ...
  but remains compatible with prior command lines:
    readseq infile -f2 -outfile -i3 ...
+ added GCG MSF multi sequence file format
+ added PIR/CODATA format
+ added NCBI ASN.1 sequence file format
+ added Pretty, multi sequence pretty output (only)
+ added PAUP multi seq format
+ added degap option
+ added Gary Williams (GWW, G.Williams@CRC.AC.UK) 
    reverse-complement option.
+ added support for reading Phylip formats 
    (interleave & sequential)
* string fixes, dropped need for compiler flags NOSTR, FIXTOUPPER, 
    NEEDSTRCASECMP
* changed 32bit checksum to default, -DSMALLCHECKSUM for 
    GCG version