MEME Users Manual

Copyright 1994, The Regents of the University of California

If you use MEME or MAST in your research, please cite:

Timothy L. Bailey and Charles Elkan, "Fitting a mixture model by expectation maximization to discover motifs in biopolymers" , Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, (28-36), AAAI Press, 1994.





MEME -- Multiple EM for Motif Elicitation

MEME is a tool for discovering motifs in a group of related DNA or protein sequences.

A motif is a sequence pattern that occurs repeatedly in a group of related protein or DNA sequences. MEME represents motifs as position-dependent letter-probability matrices which describe the probability of each possible letter at each position in the pattern. Individual MEME motifs do not contain gaps. Patterns with variable-length gaps are split by MEME into two or more separate motifs.

MEME takes as input a group of DNA or protein sequences (the training set) and outputs as many motifs as requested. MEME uses statistical modeling techniques to automatically choose the best width and description for each motif. For each motif MEME discovers, there are several outputs:

  • A summary line showing the width and estimated number of occurrences of the motif in the training set.
  • A simplified letter-probability matrix.
  • A diagram showing the degree of conservation at each motif position.
  • A multilevel consensus sequence showing the most conserved letter(s) at each motif position.
  • A position-dependent scoring matrix for use by the MAST database search program.
  • The motif letter-probability matrix.

For each motif that it discovers in the training set, MEME prints the following information:

Summary Line

This line gives the width (`width') and expected number of occurrences in the training set (`sites') of the motif. MEME numbers the motifs consecutively from one as it finds them. MEME usually finds the most statistically significant motifs first. Each motif describes a pattern of a fixed width--no gaps are allowed in MEME motifs. MEME estimates the number of places the motif occurs in the training set. This need not be an integer value.

Simplified Motif Letter-probability Matrix

MEME motifs are represented by letter-probability matrices that specify the probability of each possible letter appearing at each possible position in an occurrence of the motif. In order to make it easier to see which letters are most likely in each of the columns of the motif, the simplified motif shows the letter probabilities multiplied by 10 rounded to the nearest integer. Zeros are replaced by ":" (the colon) for readability.

Information Content Diagram

The information content diagram provides an idea of which positions in the motif are most highly conserved. Each column (position) in a motif can be characterized by the amount of information it contains (measured in bits). Highly conserved positions in the motif have high information; positions where all letters are equally likely have low information. The diagram is printed so that each column lines up with the same column in the simplified motif letter-probability matrix above it. Summing the information content for each position in the motif gives the total information content of the motif (shown in parentheses to the left of the diagram). This gives a measure of the usefulness of the motif for database searches. For a motif to be useful for database searches, it must as a rule contain at least log_2(N) bits of information where N is the number of sequences in the database being searched. For example, to effectively search a database containing 100,000 sequences for occurrences of a single motif, the motif should have an IC of at least 16.6 bits. Motifs with lower information content are still useful when a family of sequences shares more than one motif since they can be combined in multiple motif searches (using MAST).

Multilevel Consensus Sequence

The multilevel consensus sequence corresponding to the motif is an aid in remembering and understanding the motif. It is calculated from the motif letter-probability matrix as follows. Separately for each column of the motif, the letters in the alphabet are sorted in decreasing order by the probability with which they are expected to occur in that position of motif occurrences. The sorted letters are then printed vertically with the most probable letter on top. Only letters with probabilities of 0.2 or higher at that position in the motif are printed. As an example, the multilevel consensus sequence of motif 2 in the sample output is:

  
      Multilevel       LITGAASGIG
      consensus         V  GS    
      sequence              G    
      

This multilevel consensus sequence says several things about the motif. First, the most likely form of the motif can be read from the top line as LITGAASGIG. Second, that only letter L has probability more than 0.2 in position 1 of the motif, both I and V have probability greater than 0.2 in position 2, etc. Third, a rough approximation of the motif can be made by converting the multilevel consensus sequence into the Prosite signature L-[IV]-T-G-[AG]-[ASG]-S-G-I-G. The multilevel consensus sequence is printed so that each column lines up with the same column in the simplified motif and information content diagrams above it.

Possible Examples of the Motif

As a further aid in understanding the motif, MEME displays a list of possible occurrences of the motif in the training set. This list is made by converting the motif letter-probability matrix into a position-dependent scoring matrix (log-odds matrix) and using that to compute a match score between each position in the training set and the motif. All positions which score above a threshold score are listed. (The threshold score is chosen by MEME such that the expected number of non-motif positions listed in error will equal the number of actual motif positions not listed.) The format of the list is sequence name, starting position of the (putative) occurrence, match score of the position, and the actual sequence including the ten positions before and after the motif occurrence (`site').

Position-dependent Scoring Matrix

The position-dependent scoring matrix corresponding to the motif is printed for use by dataase search programs such as MAST. This matrix is a log-odds matrix calculated by taking the log (base 2) of the ratio p/f at each position in the motif where p is the probability of a particular letter at that position in the motif, and f is the average frequency of that letter in the training set. The scoring matrix is printed "sideways"--columns correspond to the letters in the alphabet (in the same order as shown in the simplified motif) and rows corresponding to the positions of the motif, position one first. The scoring matrix is preceded by a line starting with "log-odds matrix:" and containing the length of the alphabet, width of the motif, number of characters in the training set and the scoring threshold used in the list of possible motif examples.

Motif Letter-probability Matrix

The motif itself is a position-dependent letter-probability matrix giving, for each position in the pattern, the probabilities of each possible letter occurring there. The letter-probability matrix is printed "sideways"--columns correspond to the letters in the alphabet (in the same order as shown in the simplified motif) and rows corresponding to the positions of the motif, position one first. The motif is preceded by a line starting with "letter-probability matrix:" and containing the length of the alphabet, width of the motif and number of characters in the training set.

USAGE

meme [-p ] [optional arguments ...]

[-p <np>]
use parallel version with processors. default is serial

<datafile>
file containing sequences in FASTA format

[-protein]
assume sequences use IUPAC protein alphabet

[-dna]
assume sequences use DNA alphabet

[-alph <alphabet>]
a string of letters in quotes

[-mod oops|zoops|tcm]
motif distribution

[-pal]
(DNA) allow palindromes

[-pal_only]
(DNA) force palindromes

[-noshorten]
do not allow motifs shorter than <minw>

[-nsites <nsites>]
expected number of sites for each motif

[-minsites <minsites>]
minimum number of sites for each motif

[-maxsites <maxsites>]
maximum number of sites for each motif

[-w <w>]
starting motif width to try

[-minw <minw>]
minumum starting motif width to try

[-maxw <maxw>]
maximum starting motif width to try

[-nmotifs <nmotifs>]
maximum number of motifs to find

[-prior dirichlet|dmix|mega|megap|addone]
type of prior to use

[-brief]
brief output--do not print documentation

[-b <b>]
strength of the prior

[-spmap ic|pam]
starting point seq to theta mapping type

[-spfuzz <spfuzz>]
fuzziness of sequence to theta mapping

[-maxiter <maxiter>]
maximum EM iterations to run

[-distance <distance>]
EM convergence criterion

[-cons <cons>]
consensus sequence to start EM from

[-chi <chi>]
maximum motif LRT significance level

[-adj none|bon|root]
type of LRT adjustment

[-maxsize <maxsize>]
maximum dataset size in characters

[-nostatus]
do not print progress reports

[-c53]
(DNA) use 5 to 3 complementary strand as well

[-c35]
(DNA) use 3 to 5 complementary strand as well

[-w35]
(DNA) use 3 to 5 main strand as well
REQUIRED ARGUMENTS:
<dataset>
The training set of sequences in Pearson/FASTA format. If the name stdin'' is given, MEME reads from standard input. Sequences may be in capital or lowercase or both.
OPTIONAL ARGUMENTS:
MEME has a large number of optional inputs that can be used to fine-tune its behavior. To make these easier to understand they are divided into the following categories:
ALPHABET
control the alphabet for the motifs (patterns) that MEME will search for

DISTRIBUTION
control how MEME assumes the occurrences of the motifs are distributed throughout the training set sequences

SEARCH
control how MEME searches for motifs

SYSTEM
the -p argument causes a version of MEME compiled for a parallel CPU architecture to be run

In what follows, <n> is an integer, <a> is a decimal number, and <string> is a string of characters.

ALPHABET

The default alphabet is the IUPAC protein alphabet.

-protein
Use the standard IUPAC protein alphabet: ACDEFGHIKLMNPQRSTVWY
Conversions:
The following conversions from ambiguous to unambiguous letter codes are made automatically by MEME for protein sequences:
B --> D (Asp, Asn to Asp)
U --> C (selenocysteine to cysteine)
X --> C (unknown to cysteine)
Z --> E (Glu, Gln to Glu)

-dna
Use the standard DNA alphabet: ACGT. The following conversions from ambiguous to unambiguous letter codes are made automatically by MEME for DNA sequences:
B --> C (GTC to C)
D --> G (GAT to G)
H --> A (ACT to A)
K --> T (GT to T)
M --> A (AC to A)
N --> C (any to C)
R --> A (GA to A)
S --> G (GC to G)
U --> T (uridine to T)
V --> G (GCA to G)
W --> T (AT to T)
Y --> C (TC to C)

-alph <string>
Use <string> as the alphabet. <string> should contain all the characters used in the sequences in <dataset>. MEME does NOT understand characters that stand for several different characters in the sequence alphabet.

DISTRIBUTION

If you know how occurrences of motifs are distributed in the training set sequences, you can specify it with the following optional switches. The default distribution of motif occurrences is assumed to be zero or one occurrence of per sequence.

-mod <string>
The type of distribution to assume.

oops
One Occurrence Per Sequence
MEME assumes that each sequence in the dataset contains exactly one occurrence of each motif. This option is the fastest and most sensitive but the motifs returned by MEME may be "blurry" if any of the sequences is missing them.

zoops
Zero or One Occurrence Per Sequence
MEME assumes that each sequence may contain at most one occurrence of each motif. This option is useful when you suspect that some motifs may be missing from some of the sequences. In that case, the motifs found will be more accurate than using the first option. This option takes more computer time than the first option (about twice as much) and is slightly less sensitive to weak motifs present in all of the sequences.

tcm
Two-Component Mixture
MEME assumes each sequence may contain any number of non-overlapping occurrences of each motif. This option is useful when you suspect that motifs repeat multiple times within a single sequence. In that case, the motifs found will be much more accurate than using one of the other options. This option can also be used to discover repeats within a single sequence. This option takes the much more computer time than the first option (about ten times as much) and is somewhat less sensitive to weak motifs which do not repeat within a single sequence than the other two options.

SEARCH

A) MOTIF WIDTH

-w <n>
-minw <n>
-maxw <n>
-noshorten
The width of the motif(s) to search for. If -w is given, only that width is tried. Otherwise, widths between -minw and -maxw are tried and the most statistically significant one is chosen for the motif. MEME tries to shorten motifs unless -noshorten given.
Default:
-minw 8, -maxw 60 (defined in user.h) With the default minw and maxw, MEME tests motifs with initial widths of 8, 11, 15, 21, 29, 41 and 57.

Note: If -maxw <n> or -w <n> is greater than the length of the shortest sequence in the dataset, <n> is reset by MEME to that value.

B) NUMBER OF MOTIF OCCURENCES
-nsites <n>
-minsites <n>
-maxsites <n>
The (expected) number of occurrences of each motif. If -nsites is given, only that number of occurrences is tried. Otherwise, numbers of occurrences between -minsites and -maxsites are tried as initial guesses for the number of motif occurrences. These switches are ignored if mod = oops.
Default:
-minsites sqrt(number sequences)
-maxsites 1/(L-2+1) (zoops)
-maxsites(size of dataset)/(2w) (tcm)
C) NUMBER OF MOTIFS
-nmotifs <n>
The number of *different* motifs to search for. MEME will search for and output <n> motifs.
Default: 1
-chi <a>
Quit looking form motifs if objective function falls below <a>.
Default: 1
(so MEME never quits before -nmotifs <n> have been found.)
D) DNA PALINDROMES, STRAND AND DIRECTION
-pal
-pal_only
Choosing -pal causes MEME to look for palindromes in DNA datasets. MEME automatically decides if a motif appears to be a DNA palindrome or not. If MEME decides that a motif is a palindrome, it averages the letter frequencies in corresponding columns together. For instance, if the width of the motif is 10, columns 1 and 10, 2 and 9, 3 and 8, etc., are averaged together. The averaging combines the frequency of A in one column with T in the other, and the frequency of C in one column with G in the other. Choosing -pal_only causes MEME to look only for DNA palindromes. If neigher option is not chosen, MEME does not search for DNA palindromes.

-c53
include the complement strand in the 5' to 3' direction
-c35
include the complement strand in the 3' to 5' direction
-w35
include the main strand in the 3' to 5' direction By default MEME looks for DNA motifs only in the left-to-right direction on the sequences in the dataset. You can specify that any or all of the other three strand directions also be included. For example, to search for motifs on both the main strand and the inverse complement strand, use the -c53 switch on the command line.
Note: these switches may only be used if mod = oops.
E) EM ALGORITHM
-maxiter <n>
The number of iterations of EM to run from any starting point. EM is run for <n> iterations or until convergence (see -distance, below) from each starting point.
Default: 50

-distance <a>
The convergence criterion. MEME stops iterating EM when the change in the motif frequency matrix is less than <a>. (Change is the euclidean distance between two successive frequency matrices.)
Default: 0.001

-adj <string>
The type of adjustment made to the LRT-based objective function used by MEME.

none - use significance level of LRT
bon - use Bonferroni-like adjustment
root - use n-th root of LRT sig. level (default)

Adjustments are listed in order of favoring *shorter* motif widths. root adjustment favors shortest widths.

-prior <string>
The prior distribution on the model parameters:

dirichlet
simple Dirichlet prior This is the default for -dna and -alph.
dmix
mixture of Dirichlets prior This is the default for -protein with -mod oops.
mega
extremely low variance dmix; variance is scaled inversely with the size of the dataset. This is the default for -protein with -mod tcm.
megap
mega for all but last iteration of EM; dmix on last iteration. This is the default for -protein with -mod zoops.
addone
add +1 to each observed count

-b <a>
The strength of the prior on model parameters:

<a> = 0 means use intrinsic strength of prior for prior = dmix. Defaults: 1 if prior = dirichlet, 0 if prior = dmix

-plib <string>
The name of the file containing the Dirichlet prior in the format of file prior30.plib.
F) SELECTING STARTS FOR EM

The default is for MEME to search the dataset for good starts for EM. How the starting points are derived from the dataset is specified by the following switches.

The default type of mapping MEME uses is:
-spmap uni for -dna and -alph <string>
-spmap pam for -protein

-spfuzz <a>
The fuzziness of the mapping. Possible values are greater than 0. Meaning depends on -spmap, see below.

-spmap <string>
The type of mapping function to use.
uni
Use add-<a> prior when converting a substring to an estimate of theta. Default -spfuzz <a>: 0.5
pam
Use columns of PAM <a> matrix when converting a substring to an estimate of theta. Default -spfuzz <a>: 120 (PAM 120)
Other types of starting points can be specified using the following switches.

-cons <string>
Override the sampling of starting points and just use a starting point derived from <string>. This is useful when an actual occurence of a motif is known and can be used as the starting point for finding the motif.

EXAMPLES

The following examples use data files provided in this release of MEME. MEME writes its output to standard output, so you will want to redirect it to a file in order for use with MAST.

1) A simple DNA example:

meme crp0.s -dna -mod oops -pal > ex1
MEME looks for a single motif in the file crp0.s which contains DNA sequences in FASTA format. The OOPS model is used so MEME assumes that every sequence contains exactly one occurrence of the motif. The palindrome switch is given so motifs are tested to see if they are palindromes and reported as such if they are. MEME automatically chooses the best width for the motif in this example since no width was specified.

2) A fast DNA example:

meme crp0.s -dna -mod oops -pal -w 20 -noshorten > ex2
This example differs from example 1) in that MEME is told to only consider motifs of width 20. This causes MEME to execute about 10 times faster. The -w and -noshorten switches can also be used with protein datasets if the width of the motifs are known in advance.

3) A simple protein example:

meme lipocalin.s -mod oops -maxw 20 -nmotifs 2 > ex3
MEME searches for two motifs each of width less than or equal to 20. (Specifying -maxw 20 makes MEME run faster since it does not have to consider motifs longer than 20.) Each motif is assumed to occur in each of the sequences because the OOPS model is specified.

4) Another simple protein example:

meme farntrans5.s -mod tcm -maxw 50 -nmotifs 3 > ex4

MEME searches for three motifs of maximum width 50 using the TCM sequence model which allows each motif to have any number of occurrences in each sequence. This dataset contains motifs with multiple repeats in each sequence.