MAST Output
The output given below has been edited for briefness. Places where output has been removed are indicated by vertical elipses.
program: mast
memefile: exp32.zoops
searching: sprot31
args: exp32.zoops -d sprot31 -c 3 -z 7.5
mast version 1.2 ($Date: 1996/04/18 00:52:07 $)
If you use this program in your research, please cite:
Timothy L. Bailey and Charles Elkan,
"Fitting a mixture model by expectation maximization to discover motifs in
biopolymers",
Proceedings of the Second International Conference on Intelligent Systems for
Molecular Biology, (28-36), AAAI Press, 1994.
Motif widths and expected mean and sd of subsequence scores:
motif 1 width 13 mu -40.4111 sigma 9.40683
motif 2 width 7 mu -26.6725 sigma 7.23677
motif 3 width 11 mu -27.9025 sigma 7.33301
Normalization for MEAN:
Curve fit to MEAN using 978 length pools
137 of 43161 scores were rejected
Initial values of function parameters:
mu = -94.9860
sigma = 23.9766
w = 10.3333
Normalization equation for MEAN:
MEAN(x) = u(x-w) + gamma * a(x-w)
a(x) = sigma / sqrt(2*log(x))
u(x) = sigma * c(x) + mu
c(x) = sqrt(2*log(x)) - (log(log(x))+log(4*pi))/(2*sqrt(2*log(x)))
gamma = 0.5772
pi = 3.1415
mu = -106.0534
sigma = 30.8186
w = 10.1755
and x is the length of the sequence.
ChiSq: 2060.00 ChiSq/nu: 2.113
Goodness-of-fit: 6.35e-79
Normalization for SD:
Curve fit to SD using 49 length pools
Initial values of function parameters:
sigma = 17.7931
w = 10.1755
Normalization equation for SD:
SD(x) = a(x-w) * pi / sqrt(6)
a(x) = sigma / sqrt(2*log(x))
pi = 3.1415
sigma = 20.2847
w = 21.3525
and x is the length of the sequence.
ChiSq: 301.07 ChiSq/nu: 6.406
Goodness-of-fit: 2.4e-36
Length-normalized ZSCORE:
ZSCORE = ( score - MEAN(x) ) / SD(x)
and x is the length of the sequence.
Observed numbers of database sequences achieving
various ZSCORE values:
Histogram units: = 171 sequences : fewer than 171 sequences
min cumul count
-5.2 43161 4|:
-4.8 43157 5|:
-4.2 43152 10|:
-3.8 43142 20|:
-3.2 43122 52|:
-2.8 43070 285|=
-2.2 42785 1020|=====
-1.8 41765 2802|================
-1.2 38963 5648|=================================
-0.8 33315 8003|==============================================
-0.2 25312 8542|=================================================
0.2 16770 7073|=========================================
0.8 9697 4659|===========================
1.2 5038 2595|===============
1.8 2443 1328|=======
2.2 1115 571|===
2.8 544 246|=
3.2 298 90|:
3.8 208 39|:
4.2 169 20|:
4.8 149 12|:
5.2 137 5|:
5.8 132 4|:
6.2 128 4|:
6.8 124 10|:
7.2 114 3|:
7.8 111 5|:
8.2 106 15|:
8.8 91 11|:
9.2 80 18|:
9.8 62 4|:
10.2 58 7|:
10.8 51 5|:
11.2 46 11|:
11.8 35 12|:
12.2 23 8|:
12.8 15 9|:
13.2 6 2|:
13.8 4 3|:
14.2 1 1|:
The following 112 sequences had ZSCORES at least 7.5 sd above the mean.
Table is sorted by decreasing ZSCORE.
SEQUENCE DESCRIPTION MAXSUM ZSCORE LEN
PGDH_HUMAN 15-HYDROXYPROSTAGLANDIN DEHYDRO... 96.0 14.5 266
DHKR_STRCM MONENSIN POLYKETIDE SYNTHASE PU... 90.8 13.8 261
ACT3_STRCO PUTATIVE KETOACYL REDUCTASE (EC... 90.5 13.8 261
.
.
.
ADH_DROAD ALCOHOL DEHYDROGENASE (EC 1.1.1... 48.1 8.4 253
ADH_DROHE ALCOHOL DEHYDROGENASE (EC 1.1.1... 48.1 8.4 253
ADH_DRODI ALCOHOL DEHYDROGENASE (EC 1.1.1... 48.1 8.4 253
ADH_DROPL ALCOHOL DEHYDROGENASE (EC 1.1.1... 48.1 8.4 253
ADH_DROSU ALCOHOL DEHYDROGENASE (EC 1.1.1... 47.2 8.3 254
MAS1_AGRRA AGROPINE SYNTHESIS REDUCTASE. 49.7 8.3 476
DHB3_HUMAN ESTRADIOL 17 BETA-DEHYDROGENASE... 47.1 8.2 310
25KD_SARPE DEVELOPMENT-SPECIFIC 25 KD PROT... 46.1 8.1 258
YABA_BACNO HYPOTHETICAL PROTEIN IN AABA 3'... 42.0 8.0 106
NODG_RHIMS NODULATION PROTEIN G (HOST-SPEC... 43.2 7.8 244
BDH_BOVIN D-BETA-HYDROXYBUTYRATE DEHYDROG... 41.8 7.8 178
YDFG_SALTY HYPOTHETICAL PROTEIN IN DCP 3'R... 38.9 7.7 96
Schematic diagrams of spacings of motifs for high-scoring sequences:
[n] occurrence of motif n (score >= threshold)
motif 1 threshold = 8.102
motif 2 threshold = 8.133
motif 3 threshold = 8.167
-spacer- distance to start of next motif occurrence
SEQUENCE SCHEMATIC DIAGRAM
PGDH_HUMAN 5-[1]-64-[3]-57-[2]-109
DHKR_STRCM 6-[1]-62-[3]-64-[2]-98
ACT3_STRCO 6-[1]-62-[3]-64-[2]-98
DHB1_HUMAN 2-[1]-66-[3]-62-[2]-166
.
.
.
ADH_DROPL 81-[3]-57-[2]-97
ADH_DROSU 82-[3]-57-[2]-97
MAS1_AGRRA 245-[1]-59-[3]-148
DHB3_HUMAN 48-[1]-61-[3]-64-[2]-106
25KD_SARPE 82-[3]-57-[2]-101
YABA_BACNO [1]-59-[3]-23
NODG_RHIMS 6-[1]-132-[2]-86
BDH_BOVIN 9-[1]-19-[3]-126
YDFG_SALTY [1]-58-[3]-14
Motif locations and their scores for high-scoring sequences:
PGDH_HUMAN 15-HYDROXYPROSTAGLANDIN DEHYDROGENASE (NAD(+)) (EC
1.1.
PGDH_HUMAN 1.141) (PGDH).
PGDH_HUMAN LENGTH = 266 MAXSUM = 96.0 ZSCORE = 14.49
PGDH_HUMAN 1 37.21
PGDH_HUMAN 1 1111111111111
PGDH_HUMAN 1
MHVNGKVALVTGAAQGIGRAFAEALLLKGAKVALVDWNLEAGVQCKAALD
PGDH_HUMAN 51 34.51
PGDH_HUMAN 51 33333333333
PGDH_HUMAN 51
EQFEPQKTLFIQCDVADQQQLRDTFRKVVDHFGRLDILVNNAGVNNEKNW
PGDH_HUMAN 101
PGDH_HUMAN 101
PGDH_HUMAN 101
EKTLQINLVSVISGTYLGLDYMSKQNGGEGGIIINMSSLAGLMPVAQQPV
PGDH_HUMAN 151 24.25
PGDH_HUMAN 151 2222222
PGDH_HUMAN 151
YCASKHGIVGFTRSAALAANLMNSGVRLNAICPGFVNTAILESIEKEENM
PGDH_HUMAN 201
PGDH_HUMAN 201
PGDH_HUMAN 201
GQYIEYKDHIKDMIKYYGILDPPLIANGLITLIEDDALNGAIMKITTSKG
PGDH_HUMAN 251
PGDH_HUMAN 251
PGDH_HUMAN 251 IHFQDYDTTPFQAKTQ
.
.
.
YDFG_SALTY HYPOTHETICAL PROTEIN IN DCP 3'REGION (FRAGMENT).
YDFG_SALTY LENGTH = 96 MAXSUM = 38.9 ZSCORE = 7.65
YDFG_SALTY 1 17.85
YDFG_SALTY 1 1111111111111
YDFG_SALTY 1
MIVLVTGATAGFGECIARRFVENGHKVIATGRRHERLQALKDELGENVLT
YDFG_SALTY 51 25.36
YDFG_SALTY 51 33333333333
YDFG_SALTY 51 AQLDVQPRGHRRDDGLSASQWRDIDVLVNNAGLALGLEPAHKASVE
Explanation of MAST output
The the output file produced by MAST when it was used to search for occurrences of a group of MEME-discovered motifs in SWISS-PROT release 31 is shown above.
The first four lines indicate
- the program (MAST),
- the name of the file containing the MEME-discovered motifs (exp32.zoops),
- the name of the database being searched (sprot31)
- arguments to the program
- -c 3 means only the first 3 motifs found by MEME are used in the search
- -z 7.5 means only sequences with MAXSUM z-scores above 7.5 will be reported
After some lines giving the version of MAST being run and the date on which it was built, the motifs are described. The width, and an estimate of mean and standard deviation of random subsequences scored with each motif are given.
Next information describing how MAXSUM scores are converted into ZSCORES is given. There are two normalization equations for the MAXSUM score average and standard deviation for sequences of different lengths. The ZSCORE of a sequence of length x is defined as
ZSCORE = (MAXSUM - AVG(x)) / SD(x)
where AVG(x) is the average MAXSUM score for random sequences of length x and SD(x) is the standard deviation of MAXSUM scores for sequences of length x.
The MAXSUM scores are assumed to be drawn from a distribution which is the sum of Gaussian Extreme Value random variables. This provides the form of the equations for AVG(x) and SD(x). The values of the free parameters w, mu and sigma in the equations are estimated empirically by doing a chi-square curve fit to the (length, MAXSUM) score pairs observed with the dataset and motifs being used. After an initial curve fit to AVG(x) and using an initial estimate of SD(x), outlier points whose ZSCORE would be more than 5 are removed and the final curve fits are done. These curves for AVG(x) and SD(x) are then used to calcute ZSCORE for each (length, MAXSUM) pair.
The histogram of numbers of database sequences having different ZSCORES comes next. The first column shows the lowest ZSCORE in the corresponding column (ie, the "bottom" of the bin.) The second column shows the total number of sequences having ZSCORES greater than that value. The third column shows the number of sequences with ZSCORES in the bin of the histogram. This number is also shown visually by the bar to the right of third column.
Next is shown the MAXSUM and ZSCORES for sequences with ZSCORES over the user-specified threshold. The sequences are sorted in decreasing order of ZSCORE. The name of the sequence, an abbreviated description from the database, the MAXSUM score, the ZSCORE and the length of each sequence is printed.
The next section of the output shows schematic diagrams for the sequences. The name of the sequence is shown left of the schematic diagram.
The diagrams show the order and spacing of the motif occurrences in each sequence. Motif occurrences are shown as [n] and spacings as -n-, where "n" is the motif number or length of the space between motifs, respectively. Motif occurrences are defined as positions on the sequence with log-odds scores above the MEME-chosen threshold. (The threshold for each motif is shown at the top of the diagrams.) If motif occurrences would overlap, only the non-overlapping motif occurrences whose total scores is highest are over the user-specified threshold are shown.
In the final section, the same sequences shown as schematic diagrams are shown annotated with the motif occurrences and their raw scores (in bits) indicated above the actual sequence. The first lines show the name of the sequence and its complete description from the database, followed by its length, MAXSUM and ZSCORE. Then groups of three lines are printed showing
- log-odds scores of individual motif occurrences,
- positions of the occurrences, and
- the actual sequence.