×

Statistical methods for DNA sequence segmentation. (English) Zbl 0960.62121

Summary: This article examines methods, issues and controversies that have arisen over the last decade in the effort to organize sequences of DNA base information into homogeneous segments. An array of different models and techniques have been considered and applied. We demonstrate that most approaches can be embedded into a suitable version of the multiple change-point problem, and we review the various methods in this light. We also propose and discuss a promising local segmentation method, namely, the application of split local polynomial fitting. The genome of bacteriophage \(\lambda\) serves as an example sequence throughout the paper.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
92C40 Biochemistry, molecular biology
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Amfoh, K. K., Shaw, R. F. and Bonney, G. E. (1994). The use of logistic models for the analysis of codon frequencies of DNA sequences in terms of explanatory variables. Biometrics 50 1054-1063. · Zbl 0825.62799
[2] Auger, I. E. and Lawrence, C. E. (1989). Algorithms for the optimal identification of segment neighborhoods. Bulletin of Mathematical Biology 51 39-54. · Zbl 0658.92010
[3] Avnir, D., Biham, O., Lidar, D. and Malcai, O. (1998). Is the geometry of Nature fractal? Science 279 39-40. · Zbl 1225.37100
[4] Barry, D. and Hartigan, J. A. (1992). Product partition models for change-point models. Ann. Statist. 20 260-279. · Zbl 0780.62071
[5] Bement, T. R. and Waterman, M. S. (1977). Locating maximum variance segments in sequential data. Mathematical Geology 9 55-61. Bernardi, G., Olofsson, B., Filipski, J., Zerial, M., Salinas, J.,
[6] Cuny, G., Meunier-Rotival, M. and Rodier, F. (1985). The mosaic genome of warm-blooded vertebrates. Science 228 953-958.
[7] Bhattachary a, P. K. (1994). Some aspects of change-point analysis. In Change-Point Problems (E. Carlstein, H.-G. M üller and D. Siegmund, eds.) 28-56. IMS, Hay ward, CA. · Zbl 1157.62331
[8] Bickmore, W. and Sumner, A. T. (1989). Mammalian chromosome banding-an expression of genome organization. Trends in Genetics 5 144-148.
[9] Braun, J. V. and M üller, H. G. (1998). Quasi-likelihood fitting of multiple change-points, with application to DNA segmentation. Technical report, Univ. California, Davis.
[10] Brockwell, P. J. and Davis, R. A. (1991). Time Series: Theory and Methods. Springer, New York. Buldy rev, S. V., Goldberger, A. L., Havlin, S., Peng, C.-K., · Zbl 0709.62080
[11] Simons, M., Sciortino, F. and Stanley, H. E. (1993). Comment. Phy s. Rev. Lett. 71 1776.
[12] Carlin, B. P., Gelfand, A. E. and Smith, A. F. M. (1992). Hierarchical Bayesian analysis of changepoint problems. J. Roy. Statist. Soc. Ser. B 41 389-405. · Zbl 0825.62408
[13] Carlstein, E., M üller, H.-G. and Siegmund, D., eds. (1994). Change-Point Problems. IMS Hay ward, CA. · Zbl 0942.00037
[14] Christensen, J. and Rudemo, M. (1996). Multiple change-point analysis of disease incidence rates. Preventive Veterinary Medicine 26 53-76.
[15] Churchill, G. A. (1989). Stochastic models for heterogenous DNA sequences. Bulletin of Mathematical Biology 51 79-94. · Zbl 0662.92012
[16] Churchill, G. A. (1992). Hidden Markov chains and the analysis of genome structure. Computers in Chemistry 16 107- 115. · Zbl 0752.92015
[17] Curnow, R. N. and Kirkwood, T. B. L. (1989). Statistical analysis of deoxy ribonucleic acid sequence data-a review. J. Roy. Statist. Soc. Ser. B 152 199-220.
[18] Cvijovic, D. and Klinowski, J. (1995). Taboo search-an approach to the multiple minima problem. Science 267 664- 666. · Zbl 1226.90101
[19] Dupuis, J. (1994). Change-point problem in determination of identity-by-descent. Technical Report 1, Stanford Univ.
[20] Elton, R. A. (1974). Theoretical models for heterogeneity of base composition in DNA. Journal of Theoretical Biology 45 533- 553.
[21] Fan, J. and Gijbels, I. (1996). Local Poly nomial Modelling. Chapman and Hall, London. · Zbl 0873.62037
[22] Fan, J., Heckman, N. E. and Wand, M. P. (1995). Local poly nomial kernel regression for generalized linear models and quasi-likelihood functions. J. Amer. Statist. Assoc. 90 141- 150. JSTOR: · Zbl 0818.62036
[23] Fickett, J. W., Torney, D. C. and Wolf, D. R. (1992). Base compositional structure of genomes. Genomics 13 1056-1064.
[24] Fu, Y.-X. and Curnow, R. N. (1990). Maximum likelihood estimation of multiple change points. Biometrika 77 563-573. JSTOR: · Zbl 0724.62025
[25] Gey er, C. J. (1995). Comment on ”Bayesian computation and stochastic sy stems,” by J. Besag, P. Green, D. Higdon and K. Mengerson. Statist. Sci. 10 46-48.
[26] Gillespie, J. H. (1991). The Causes of Molecular Evolution. Oxford Univ. Press. · Zbl 0810.46071
[27] Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 41 389-405. JSTOR: · Zbl 0861.62023
[28] Hartigan, J. A. (1990). Partition models. Comm. Statist. Theory Methods 19 2745-2756. · Zbl 04500491
[29] Holmquist, G. P. (1989). Evolution of chromosome bands: Molecular ecology of noncoding DNA. Journal of Molecular Evolution 28 469-486.
[30] Ikemura, T., Wada, K. and Aota, S. (1990). Giant G+C
[31] Josse, J., Kaiser, A. D. and Kornberg, A. (1961). Enzy matic sy nthesis of deoxy ribonucleic acid. VII. Frequencies of nearest neighbor base sequences in deoxy ribonucleic acid. Journal of Biological Chemistry 236 864-875.
[32] Karlin, S. and Altschul, S. F. (1990). Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Nat. Acad. Sci. U.S.A. 87 2264-2268. · Zbl 0695.92004
[33] Karlin, S. and Brendel, V. (1992). Chance and statistical significance in protein and DNA sequence analysis. Science 257 39-49.
[34] Karlin, S. and Brendel, V. (1993). Patchiness and correlations in DNA sequences. Science 259 677-680.
[35] Karlin, S. and Dembo, A. (1992). Limit distributions of maximal segmental score among Markov-dependent partial sums. Adv. in Appl. Probab. 24 113-140. JSTOR: · Zbl 0767.60017
[36] Karlin, S., Dembo, A. and Kawabata, T. (1990). Statistical composition of high-scoring segments from molecular sequences. Ann. Statist. 18 571-581. · Zbl 0711.92013
[37] Karlin, S., Ost, F. and Blaisdell, B. E. (1989). Patterns in DNA and amino acid sequences and their statistical significance. In Mathematical Methods for DNA Sequences (M. S. Waterman, ed.) 133-158. CRC Press, Boca Raton, FL. · Zbl 0681.92010
[38] Kimura, M. (1983). The Neutral Allele Theory of Molecular Evolution. Cambridge Univ. Press. Krogh, A., Brown, M., Mian, I. S., Sj ölander, K. and Haussler,
[39] D. (1994). Hidden Markov models in computational biology: application to protein modeling. Journal of Molecular Biology 235 1501-1531. Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S.,
[40] Neuwald, A. F. and Wootton, J. C. (1993). Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignments. Science 262 208-214.
[41] Liu, J. S. (1994). The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. J. Amer. Statist. Assoc. 89 958-966. JSTOR: · Zbl 0804.62033
[42] Liu, J. S. and Lawrence, C. E. (1996). Unified Gibbs method for biological sequence analysis. In Proceedings of the Biometrics Section 194-199. Amer. Statist. Assoc., Alexandria, VA.
[43] Liu, J. S., Neuwald, A. F. and Lawrence, C. E. (1995). Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J. Amer. Statist. Assoc. 90 1-15. · Zbl 0864.62076
[44] Loader, C. R. (1996). Change point estimation using nonparametric regression. Ann. Statist. 24 1667-1678. · Zbl 0867.62033
[45] Lombard, F. and Hart, J. D. (1994). The analysis of changepoint data with dependent errors. In Change-Point Problems (E. Carlstein, H.-G. M üller and D. Siegmund, eds.) 194-209. · Zbl 1163.62347
[46] IMS, Hay ward, CA.
[47] Maddox, J. (1992). Long-range correlations within DNA. Nature 358 103.
[48] Meselson, M., Stahl, F. W. and Vinograd, J. (1957). Equilibrium sedimentation of macromolecules in density gradients. Proc. Nat. Acad. Sci. U.S.A. 43 581-588.
[49] M üller, H. G. (1985). Empirical bandwidth choice for nonparametric kernel regression by means of pilot estimators. Statist. Decisions Suppl. 2 193-206. · Zbl 0596.62042
[50] M üller, H. G. (1992). Change-points in nonparametric regression analysis. Ann. Statist. 20 737-761. · Zbl 0783.62032
[51] M üller, H. G. (1993). Comment on ”Local regression: automatic kernel carpentry,” by T. Hastie and C. Loader. Statist. Sci. 8 134-139.
[52] M üller, H. G. and Song, K. S. (1997). A two-stage procedure for change-point detection in nonparametric regression. Statist. Probab. Lett. 34 323-335. · Zbl 0874.62035
[53] M üller, H. G. and Stadtm üller, U. (1997). Discontinuous versus smooth regression. Technical report, Univ. California, Davis. · Zbl 0954.62052
[54] Nee, S. (1992). Uncorrelated DNA walks. Nature 357 450.
[55] Neuwald, A. F., Liu, J. S. and Lawrence, C. E. (1995). Gibbs motif sampling: detection of bacterial outer membrane protein repeats. Protein Science 4 1618-1632. Peng, C. K., Buldy rev, S. V., Goldberger, A. L., Havlin, S.,
[56] Sciortino, F., Simons, M. and Stanley, H. E. (1992). Lon Pennini, E. (1997). Microbial genomes come tumbling in. Science 277 1433.
[57] Prabhu, V. V. and Claverle, J.-M. (1992). Correlations in intronless DNA. Nature 359 782.
[58] Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77 257-286.
[59] Raftery, A. E. and Akman, V. E. (1986). Bayesian analysis of a Poisson process with a change-point. Biometrika 73 85-89. Sanger, F., Coulson, A. R., Hong, G. F., Hill, D. F. and JSTOR: · Zbl 0648.62093
[60] Petersen, G. B. (1982). Nucleotide sequence of bacteriophage lambda DNA. Journal of Molecular Biology 162 729- 773.
[61] Scherer, S., McPeek, M. S. and Speed, T. P. (1994). Aty pical regions in large genomic DNA sequences. Proc. Nat. Acad. Sci. U.S.A. 91 7134-7138.
[62] Schweizer, D. and Loidl, J. (1987). A model for heterochromatin dispersion and the evolution of C-band patterns. Chromosomes Today 9 61-74.
[63] Scott, A. J. and Knott, M. (1974). A cluster analysis method for grouping means in the analysis of variance. Biometrics 30 507-512. · Zbl 0284.62044
[64] Shapiro, H. S. and Chargaff, E. (1960). Studies on the nucleotide arrangement in deoxy ribonucleic acid. IV. Patterns of nucleotide sequence in the deoxy ribonucleic acid of ry e germ and its fractions. Biochimica et Biophysica Acta 39 68-82.
[65] Skalka, A., Burgi, E. and Hershey, A. D. (1968). Segmental distribution of nucleotides in the DNA of bacteriophage lambda. Journal of Molecular Biology 34 1-16.
[66] Smith, A. F. M. (1975). A Bayesian approach to inference about a change-point in a sequence of random variables. Biometrika 62 407-416. JSTOR: · Zbl 0321.62041
[67] Staden, R. (1984). Graphical methods to determine the function of nucleic acid sequences. Nucleic Acids Research 12 521- 538.
[68] Stephens, D. A. (1994). Bayesian retrospective multiple changepoint identification. J. Roy. Statist. Soc. Ser. B 43 159-178. · Zbl 0825.62412
[69] Stoffer, D. S., Ty ler, D. E. and McDougall, A. J. (1993). Spectral analysis for categorical time series: scaling and the spectral envelope. Biometrika 80 611-622. JSTOR: · Zbl 0797.62081
[70] Tajima, F. (1991). Determination of window size for analyzing DNA sequences. Journal of Molecular Evolution 33 470-473.
[71] Venkatraman, E. S. (1992). Consistency results in multiple change-point situations. Technical report, Dept. Statistics, Stanford Univ.
[72] Voss, R. F. (1992). Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phy s. Rev. Lett. 68 3805-3808.
[73] Voss, R. F. (1993). Comment. Phy s. Rev. Lett. 71 1777.
[74] Vostrikova, L. J. (1981). Detecting ”disorder” in multidimensional random processes. Soviet Math. Dokl. 24 55-59. · Zbl 0487.62072
[75] Wallenstein, S., Naus, J. and Glaz, J. (1994). Power of the scan statistic in detecting a changed segment in a Bernoulli sequence. Biometrika 81 595-601. JSTOR: · Zbl 0810.62025
[76] Wolfe, D. A. and Schechtman, E. (1984). Nonparametric statistical procedures for the changepoint problem. J. Statist. Plann. Inference 9 389-396. · Zbl 0561.62039
[77] Yao, Y.-C. (1988). Estimating the number of change-points via Schwarz’ criterion. Statist. Probab. Lett. 6 181-189. · Zbl 0642.62016
[78] Yao, Y.-C. and Au, S. T. (1989). Least-squares estimation of a step function. Sankhy\?a Ser. A. 51 370-381. · Zbl 0711.62031
[79] Zacks, S. (1983). Survey of classical and Bayesian approaches to the change-point problem: fixed sample and sequential procedures of testing and estimation. In Recent Advances in Statistics (M. H. Rizvi, J. S. Rustagi and D. Siegmund, eds.) 245-269. Academic Press, New York. · Zbl 0563.62062
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.