MAST Output

The output given below has been edited for briefness. Places where output has been removed are indicated by vertical elipses.

program:   mast
memefile:  exp32.zoops
searching: sprot31
args: exp32.zoops -d sprot31 -c 3 -z 7.5


mast version 1.2 ($Date: 1996/04/18 00:52:07 $)

If you use this program in your research, please cite:
Timothy L. Bailey and Charles Elkan,
"Fitting a mixture model by expectation maximization to discover motifs in
biopolymers",
Proceedings of the Second International Conference on Intelligent Systems for
Molecular Biology, (28-36), AAAI Press, 1994.

Motif widths and expected mean and sd of subsequence scores:
         motif   1  width  13  mu    -40.4111  sigma     9.40683
         motif   2  width   7  mu    -26.6725  sigma     7.23677
         motif   3  width  11  mu    -27.9025  sigma     7.33301


Normalization for MEAN:

         Curve fit to MEAN using 978 length pools

         137 of 43161 scores were rejected

         Initial values of function parameters:
          mu    =   -94.9860
          sigma =    23.9766
          w     =    10.3333

         Normalization equation for MEAN:
          MEAN(x) = u(x-w) + gamma * a(x-w)
          a(x) = sigma / sqrt(2*log(x))
          u(x) = sigma * c(x) + mu
          c(x) = sqrt(2*log(x)) - (log(log(x))+log(4*pi))/(2*sqrt(2*log(x)))
          gamma =     0.5772
          pi    =     3.1415
          mu    =  -106.0534
          sigma =    30.8186
          w     =    10.1755
         and x is the length of the sequence.

         ChiSq:    2060.00      ChiSq/nu:      2.113
         Goodness-of-fit: 6.35e-79


Normalization for SD:

         Curve fit to SD using 49 length pools

         Initial values of function parameters:
          sigma =    17.7931
          w     =    10.1755

         Normalization equation for SD:
          SD(x) = a(x-w) * pi / sqrt(6)
          a(x) = sigma / sqrt(2*log(x))
          pi    =     3.1415
          sigma =    20.2847
          w     =    21.3525
         and x is the length of the sequence.

         ChiSq:     301.07      ChiSq/nu:      6.406
         Goodness-of-fit: 2.4e-36


Length-normalized ZSCORE:

          ZSCORE = ( score - MEAN(x) ) / SD(x)

         and x is the length of the sequence.


        Observed numbers of database sequences achieving
        various ZSCORE values:

Histogram units:   = 171 sequences   : fewer than 171 sequences

   min   cumul  count
  -5.2   43161      4|:
  -4.8   43157      5|:
  -4.2   43152     10|:
  -3.8   43142     20|:
  -3.2   43122     52|:
  -2.8   43070    285|=
  -2.2   42785   1020|=====
  -1.8   41765   2802|================
  -1.2   38963   5648|=================================
  -0.8   33315   8003|==============================================
  -0.2   25312   8542|=================================================
   0.2   16770   7073|=========================================
   0.8    9697   4659|===========================
   1.2    5038   2595|===============
   1.8    2443   1328|=======
   2.2    1115    571|===
   2.8     544    246|=
   3.2     298     90|:
   3.8     208     39|:
   4.2     169     20|:
   4.8     149     12|:
   5.2     137      5|:
   5.8     132      4|:
   6.2     128      4|:
   6.8     124     10|:
   7.2     114      3|:
   7.8     111      5|:
   8.2     106     15|:
   8.8      91     11|:
   9.2      80     18|:
   9.8      62      4|:
  10.2      58      7|:
  10.8      51      5|:
  11.2      46     11|:
  11.8      35     12|:
  12.2      23      8|:
  12.8      15      9|:
  13.2       6      2|:
  13.8       4      3|:
  14.2       1      1|:




The following 112 sequences had ZSCORES at least 7.5 sd above the mean.
Table is sorted by decreasing ZSCORE.

SEQUENCE                 DESCRIPTION                        MAXSUM ZSCORE   LEN

PGDH_HUMAN               15-HYDROXYPROSTAGLANDIN DEHYDRO...   96.0   14.5   266
DHKR_STRCM               MONENSIN POLYKETIDE SYNTHASE PU...   90.8   13.8   261
ACT3_STRCO               PUTATIVE KETOACYL REDUCTASE (EC...   90.5   13.8   261
.
.
.
ADH_DROAD                ALCOHOL DEHYDROGENASE (EC 1.1.1...   48.1    8.4   253
ADH_DROHE                ALCOHOL DEHYDROGENASE (EC 1.1.1...   48.1    8.4   253
ADH_DRODI                ALCOHOL DEHYDROGENASE (EC 1.1.1...   48.1    8.4   253
ADH_DROPL                ALCOHOL DEHYDROGENASE (EC 1.1.1...   48.1    8.4   253
ADH_DROSU                ALCOHOL DEHYDROGENASE (EC 1.1.1...   47.2    8.3   254
MAS1_AGRRA               AGROPINE SYNTHESIS REDUCTASE.        49.7    8.3   476
DHB3_HUMAN               ESTRADIOL 17 BETA-DEHYDROGENASE...   47.1    8.2   310
25KD_SARPE               DEVELOPMENT-SPECIFIC 25 KD PROT...   46.1    8.1   258
YABA_BACNO               HYPOTHETICAL PROTEIN IN AABA 3'...   42.0    8.0   106
NODG_RHIMS               NODULATION PROTEIN G (HOST-SPEC...   43.2    7.8   244
BDH_BOVIN                D-BETA-HYDROXYBUTYRATE DEHYDROG...   41.8    7.8   178
YDFG_SALTY               HYPOTHETICAL PROTEIN IN DCP 3'R...   38.9    7.7    96


Schematic diagrams of spacings of motifs for high-scoring sequences:
    [n]           occurrence of motif n (score >= threshold)
                     motif  1 threshold =  8.102
                     motif  2 threshold =  8.133
                     motif  3 threshold =  8.167
    -spacer-  distance to start of next motif occurrence


SEQUENCE                 SCHEMATIC DIAGRAM

PGDH_HUMAN               5-[1]-64-[3]-57-[2]-109
DHKR_STRCM               6-[1]-62-[3]-64-[2]-98
ACT3_STRCO               6-[1]-62-[3]-64-[2]-98
DHB1_HUMAN               2-[1]-66-[3]-62-[2]-166
.
.
.
ADH_DROPL                81-[3]-57-[2]-97
ADH_DROSU                82-[3]-57-[2]-97
MAS1_AGRRA               245-[1]-59-[3]-148
DHB3_HUMAN               48-[1]-61-[3]-64-[2]-106
25KD_SARPE               82-[3]-57-[2]-101
YABA_BACNO               [1]-59-[3]-23
NODG_RHIMS               6-[1]-132-[2]-86
BDH_BOVIN                9-[1]-19-[3]-126
YDFG_SALTY               [1]-58-[3]-14



Motif locations and their scores for high-scoring sequences:


PGDH_HUMAN               15-HYDROXYPROSTAGLANDIN DEHYDROGENASE (NAD(+)) (EC
1.1.
PGDH_HUMAN               1.141) (PGDH).
PGDH_HUMAN               LENGTH = 266  MAXSUM =  96.0  ZSCORE = 14.49
PGDH_HUMAN                  1      37.21
PGDH_HUMAN                  1      1111111111111
PGDH_HUMAN                  1
MHVNGKVALVTGAAQGIGRAFAEALLLKGAKVALVDWNLEAGVQCKAALD
PGDH_HUMAN                 51                                 34.51
PGDH_HUMAN                 51                                 33333333333
PGDH_HUMAN                 51
EQFEPQKTLFIQCDVADQQQLRDTFRKVVDHFGRLDILVNNAGVNNEKNW
PGDH_HUMAN                101
PGDH_HUMAN                101
PGDH_HUMAN                101
EKTLQINLVSVISGTYLGLDYMSKQNGGEGGIIINMSSLAGLMPVAQQPV
PGDH_HUMAN                151 24.25
PGDH_HUMAN                151 2222222
PGDH_HUMAN                151
YCASKHGIVGFTRSAALAANLMNSGVRLNAICPGFVNTAILESIEKEENM
PGDH_HUMAN                201
PGDH_HUMAN                201
PGDH_HUMAN                201
GQYIEYKDHIKDMIKYYGILDPPLIANGLITLIEDDALNGAIMKITTSKG
PGDH_HUMAN                251
PGDH_HUMAN                251
PGDH_HUMAN                251 IHFQDYDTTPFQAKTQ

.
.
.

YDFG_SALTY               HYPOTHETICAL PROTEIN IN DCP 3'REGION (FRAGMENT).
YDFG_SALTY               LENGTH = 96  MAXSUM =  38.9  ZSCORE =  7.65
YDFG_SALTY                  1 17.85
YDFG_SALTY                  1 1111111111111
YDFG_SALTY                  1
MIVLVTGATAGFGECIARRFVENGHKVIATGRRHERLQALKDELGENVLT
YDFG_SALTY                 51                      25.36
YDFG_SALTY                 51                      33333333333
YDFG_SALTY                 51 AQLDVQPRGHRRDDGLSASQWRDIDVLVNNAGLALGLEPAHKASVE

Explanation of MAST output

The the output file produced by MAST when it was used to search for occurrences of a group of MEME-discovered motifs in SWISS-PROT release 31 is shown above.

The first four lines indicate

  • the program (MAST),
  • the name of the file containing the MEME-discovered motifs (exp32.zoops),
  • the name of the database being searched (sprot31)
  • arguments to the program
    • -c 3 means only the first 3 motifs found by MEME are used in the search
    • -z 7.5 means only sequences with MAXSUM z-scores above 7.5 will be reported

After some lines giving the version of MAST being run and the date on which it was built, the motifs are described. The width, and an estimate of mean and standard deviation of random subsequences scored with each motif are given.

Next information describing how MAXSUM scores are converted into ZSCORES is given. There are two normalization equations for the MAXSUM score average and standard deviation for sequences of different lengths. The ZSCORE of a sequence of length x is defined as

ZSCORE = (MAXSUM - AVG(x)) / SD(x)

where AVG(x) is the average MAXSUM score for random sequences of length x and SD(x) is the standard deviation of MAXSUM scores for sequences of length x.

The MAXSUM scores are assumed to be drawn from a distribution which is the sum of Gaussian Extreme Value random variables. This provides the form of the equations for AVG(x) and SD(x). The values of the free parameters w, mu and sigma in the equations are estimated empirically by doing a chi-square curve fit to the (length, MAXSUM) score pairs observed with the dataset and motifs being used. After an initial curve fit to AVG(x) and using an initial estimate of SD(x), outlier points whose ZSCORE would be more than 5 are removed and the final curve fits are done. These curves for AVG(x) and SD(x) are then used to calcute ZSCORE for each (length, MAXSUM) pair.

The histogram of numbers of database sequences having different ZSCORES comes next. The first column shows the lowest ZSCORE in the corresponding column (ie, the "bottom" of the bin.) The second column shows the total number of sequences having ZSCORES greater than that value. The third column shows the number of sequences with ZSCORES in the bin of the histogram. This number is also shown visually by the bar to the right of third column.

Next is shown the MAXSUM and ZSCORES for sequences with ZSCORES over the user-specified threshold. The sequences are sorted in decreasing order of ZSCORE. The name of the sequence, an abbreviated description from the database, the MAXSUM score, the ZSCORE and the length of each sequence is printed.

The next section of the output shows schematic diagrams for the sequences. The name of the sequence is shown left of the schematic diagram.

The diagrams show the order and spacing of the motif occurrences in each sequence. Motif occurrences are shown as [n] and spacings as -n-, where "n" is the motif number or length of the space between motifs, respectively. Motif occurrences are defined as positions on the sequence with log-odds scores above the MEME-chosen threshold. (The threshold for each motif is shown at the top of the diagrams.) If motif occurrences would overlap, only the non-overlapping motif occurrences whose total scores is highest are over the user-specified threshold are shown.

In the final section, the same sequences shown as schematic diagrams are shown annotated with the motif occurrences and their raw scores (in bits) indicated above the actual sequence. The first lines show the name of the sequence and its complete description from the database, followed by its length, MAXSUM and ZSCORE. Then groups of three lines are printed showing

  1. log-odds scores of individual motif occurrences,
  2. positions of the occurrences, and
  3. the actual sequence.