next up previous contents
Next: Reference coordinate system Up: SELEX alignment format Previous: Optional annotation

Secondary structure

I use one-letter codes to indicate secondary structures. Secondary structure strings are aligned to sequence blocks just like additional sequences.

For RNA secondary structure, the symbols > and < are used for base pairs (pairs point at each other). + indicate definitely single-stranded positions, and any gap symbol indicates unassigned bases or single-stranded positions. This description roughly follows [9]. For protein secondary structure, I use E to indicate residues in -sheet, H for those in -helix, L for those in loops, and any gap symbol for unassigned or unstructured residues.

RNA pseudoknots are represented by alphabetic characters, with upper case letters representing the 5' side of the helix and lower case letters representing the 3' side. Note that this restricts the annotation to a maximum of 26 pseudoknots per sequence.

Lines beginning with #=SS or #=CS are individual or consensus secondary structure data, respectively. #=SS individual secondary structure lines must immediately follow the sequence they are associated with. There can only be one #=SS per sequence. All, some, or none of the sequences may have #=SS annotation.

#=CS consensus secondary structure predictions precede all the sequences in each block. There can only be one #=CS per file.



Sean Eddy
Mon Apr 17 09:54:19 CDT 1995