×

Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. (English) Zbl 1048.62008

Summary: We describe and develop a close relationship between two problems that have customarily been regarded as distinct: that of maximizing entropy, and that of minimizing worst-case expected loss. Using a formulation grounded in the equilibrium theory of zero-sum games between Decision Maker and Nature, these two problems are shown to be dual to each other, the solution to each providing that to the other. Although F. Topsøe [Kybernetika, Praha 15, 8–27 (1979; Zbl 0399.94007)] described this connection for the Shannon entropy over 20 years ago, it does not appear to be widely known even in that important special case.
We here generalize this theory to apply to arbitrary decision problems and loss functions. We indicate how an appropriate generalized definition of entropy can be associated with such a problem, and we show that, subject to certain regularity conditions, the above-mentioned duality continues to apply in this extended context. This simultaneously provides a possible rationale for maximizing entropy and a tool for finding robust Bayes acts. We also describe the essential identity between the problem of maximizing entropy and that of minimizing a related discrepancy or divergence between distributions. This leads to an extension, to arbitrary discrepancies, of a well-known minimax theorem for the case of Kullback-Leibler divergence (the ”redundancy-capacity theorem” of information theory).
For the important case of families of distributions having certain mean values specified, we develop simple sufficient conditions and methods for identifying the desired solutions. We use this theory to introduce a new concept of ”generalized exponential family” linked to the specific decision problem under consideration, and we demonstrate that this shares many of the properties of standard exponential families.
Finally, we show that the existence of an equilibrium in our game can be rephrased in terms of a ”Pythagorean property” of the related divergence, thus generalizing previously announced results for Kullback-Leibler and Bregman divergences [L. M. Bregman, Zh. Vychisl. Mat. Mat. Fiz. 7, 620–631 (1967; Zbl 0186.23807)].

MSC:

62B10 Statistical aspects of information-theoretic topics
91A40 Other game-theoretic models
62C20 Minimax procedures in statistical decision theory
94A17 Measures of information, entropy
91A80 Applications of game theory
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Azoury, K. S. and Warmuth, M. K. (2001). Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning 43 211–246. · Zbl 0988.68173 · doi:10.1023/A:1010896012157
[2] Barndorff-Nielsen, O. (1978). Information and Exponential Families in Statistical Theory . Wiley, New York. · Zbl 0387.62011
[3] Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis , 2nd ed. Springer, New York. · Zbl 0572.62008
[4] Berger, J. O. and Bernardo, J. M. (1992). On the development of reference priors (with discussion). In Bayesian Statistics 4 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) 35–60. Oxford Univ. Press.
[5] Bernardo, J. M. (1979). Reference posterior distributions for Bayesian inference (with discussion). J. Roy. Statist. Soc. Ser. B 41 113–147. JSTOR: · Zbl 0428.62004
[6] Billingsley, P. (1999). Convergence of Probability Measures , 2nd ed. Wiley, New York. · Zbl 0944.60003
[7] Borwein, J. M., Lewis, A. S. and Noll, D. (1996). Maximum entropy reconstruction using derivative information. I. Fisher information and convex duality. Math. Oper. Res. 21 442–468. JSTOR: · Zbl 0884.90121 · doi:10.1287/moor.21.2.442
[8] Brègman, L. M. (1967). The relaxation method of finding a common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. and Math. Phys. 7 200–217. · Zbl 0186.23807
[9] Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review 78 1–3.
[10] Censor, Y. and Zenios, S. A. (1997). Parallel Optimization: Theory, Algorithms and Applications . Oxford Univ. Press. · Zbl 0945.90064
[11] Clarke, B. and Barron, A. (1990). Information-theoretic asymptotics of Bayes methods. IEEE Trans. Inform. Theory 36 453–471. · Zbl 0709.62008 · doi:10.1109/18.54897
[12] Clarke, B. and Barron, A. (1994). Jeffreys’ prior is asymptotically least favorable under entropy risk. J. Statist. Plann. Inference 41 37–60. · Zbl 0820.62006 · doi:10.1016/0378-3758(94)90153-8
[13] Cover, T. and Thomas, J. A. (1991). Elements of Information Theory . Wiley, New York. · Zbl 0762.94001
[14] Csiszár, I. (1975). \(I\)-divergence geometry of probability distributions and minimization problems. Ann. Probab. 3 146–158. · Zbl 0318.60013 · doi:10.1214/aop/1176996454
[15] Csiszár, I. (1991). Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems. Ann. Statist. 19 2032–2066. JSTOR: · Zbl 0753.62003 · doi:10.1214/aos/1176348385
[16] Davisson, L. D. and Leon-Garcia, A. (1980). A source matching approach to finding minimax codes. IEEE Trans. Inform. Theory 26 166–174. · Zbl 0431.94026 · doi:10.1109/TIT.1980.1056167
[17] Dawid, A. P. (1986). Probability forecasting. Encyclopedia of Statistical Sciences 7 210–218. Wiley, New York.
[18] Dawid, A. P. (1992). Prequential analysis, stochastic complexity and Bayesian inference (with discussion). In Bayesian Statistics 4 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) 109–125. Oxford Univ. Press.
[19] Dawid, A. P. (1998). Coherent measures of discrepancy, uncertainty and dependence, with applications to Bayesian predictive experimental design. Technical Report 139, Dept. Statistical Science, Univ. College London.
[20] Dawid, A. P. (2003). Generalized entropy functions and Bregman divergence. Unpublished manuscript.
[21] Dawid, A. P. and Sebastiani, P. (1999). Coherent dispersion criteria for optimal experimental design. Ann. Statist. 27 65–81. · Zbl 0948.62057 · doi:10.1214/aos/1018031101
[22] DeGroot, M. H. (1962). Uncertainty, information and sequential experiments. Ann. Math. Statist. 33 404–419. · Zbl 0151.22803 · doi:10.1214/aoms/1177704567
[23] DeGroot, M. H. (1970). Optimal Statistical Decisions . McGraw-Hill, New York. · Zbl 0225.62006
[24] Della Pietra, S., Della Pietra, V. and Lafferty, J. (2002). Duality and auxiliary functions for Bregman distances. Technical Report CMU-CS-109, School of Computer Science, Carnegie Mellon Univ.
[25] Edwards, A. W. F. (1992). Likelihood , expanded ed. Johns Hopkins Univ. Press, Baltimore, MD. · Zbl 0833.62004
[26] Ferguson, T. S. (1967). Mathematical Statistics. A Decision-Theoretic Approach . Academic Press, New York. · Zbl 0153.47602
[27] Gallager, R. G. (1976). Source coding with side information and universal coding. Unpublished manuscript.
[28] Good, I. J. (1952). Rational decisions. J. Roy. Statist. Soc. Ser. B 14 107–114. JSTOR:
[29] Grünwald, P. D. (1998). The minimum description length principle and reasoning under uncertainty. Ph.D. dissertation, ILLC Dissertation Series 1998-03, Univ. Amsterdam.
[30] Grünwald, P. D. and Dawid, A. P. (2002). Game theory, maximum generalized entropy, minimum discrepancy, robust Bayes and Pythagoras. In Proc. 2002 IEEE Information Theory Workshop (ITW 2002) 94–97. IEEE, New York.
[31] Harremoës, P. and Topsøe, F. (2001). Maximum entropy fundamentals. Entropy 3 191–226. Available at www.mdpi.org/entropy/. · Zbl 1006.94007 · doi:10.3390/e3030191
[32] Harremoës, P. and Topsøe, F. (2002). Unified approach to optimization techniques in Shannon theory. In Proc. 2002 IEEE International Symposium on Information Theory 238. IEEE, New York.
[33] Haussler, D. (1997). A general minimax result for relative entropy. IEEE Trans. Inform. Theory 43 1276–1280. · Zbl 0878.94038 · doi:10.1109/18.605594
[34] Jaynes, E. T. (1957a). Information theory and statistical mechanics. I. Phys. Rev. 106 620–630. · Zbl 0084.43701 · doi:10.1103/PhysRev.106.620
[35] Jaynes, E. T. (1957b). Information theory and statistical mechanics. II. Phys. Rev. 108 171–190. · Zbl 0084.43701 · doi:10.1103/PhysRev.108.171
[36] Jaynes, E. T. (1985). Some random observations. Synthèse 63 115–138. · doi:10.1007/BF00485957
[37] Jaynes, E. T. (1989). Papers on Probability, Statistics and Statistical Physics , 2nd ed. Kluwer Academic, Dordrecht. · Zbl 0687.60085
[38] Jones, L. K. and Byrne, C. L. (1990). General entropy criteria for inverse problems, with applications to data compression, pattern classification and cluster analysis. IEEE Trans. Inform. Theory 36 23–30. · Zbl 0731.62016 · doi:10.1109/18.50370
[39] Kapur, J. N. and Kesavan, H. (1992). Entropy Optimization Principles with Applications . Academic Press, New York. · Zbl 0718.62007
[40] Kivinen, J. and Warmuth, M. K. (1999). Boosting as entropy projection. In Proc. Twelfth Annual Conference on Computational Learning Theory (COLT’99) 134–144. ACM Press, New York.
[41] Krob, J. and Scholl, H. R. (1997). A minimax result for the Kullback–Leibler Bayes risk. Econ. Qual. Control 12 147–157. · Zbl 0895.62012
[42] Kullback, S. (1959). Information Theory and Statistics . Wiley, New York. · Zbl 0088.10406
[43] Lafferty, J. (1999). Additive models, boosting, and inference for generalized divergences. In Proc. Twelfth Annual Conference on Computational Learning Theory (COLT’99) 125–133. ACM Press, New York.
[44] Lindley, D. V. (1956). On a measure of the information provided by an experiment. Ann. Math. Statist. 27 986–1005. · Zbl 0073.14103 · doi:10.1214/aoms/1177728069
[45] Merhav, N. and Feder, M. (1995). A strong version of the redundancy-capacity theorem of universal coding. IEEE Trans. Inform. Theory 41 714–722. · Zbl 0821.94020 · doi:10.1109/18.382017
[46] von Neumann, J. (1928). Zur Theorie der Gesellschaftspiele. Math. Ann. 100 295–320. · JFM 54.0543.02 · doi:10.1007/BF01448847
[47] Noubiap, R. F. and Seidel, W. (2001). An algorithm for calculating \(\Gamma\)-minimax decision rules under generalized moment conditions. Ann. Statist. 29 1094–1116. · Zbl 1041.62005 · doi:10.1214/aos/1013699995
[48] Pinsker, M. S. (1964). Information and Information Stability of Random Variables and Processes . Holden-Day, San Francisco. · Zbl 0125.09202
[49] Posner, E. (1975). Random coding strategies for minimum entropy. IEEE Trans. Inform. Theory 21 388–391. · Zbl 0319.94012 · doi:10.1109/TIT.1975.1055416
[50] Rao, C. R. (1982). Diversity and dissimilarity coefficients: A unified approach. J. Theoretical Population Biology 21 24–43. · Zbl 0516.92021 · doi:10.1016/0040-5809(82)90004-1
[51] Rényi, A. (1961). On measures of entropy and information. Proc. Fourth Berkeley Symp. Math. Statist. Probab. 1 547–561. Univ. California Press, Berkeley. · Zbl 0106.33001
[52] Rockafellar, R. T. (1970). Convex Analysis . Princeton Univ. Press. · Zbl 0193.18401
[53] Ryabko, B. Y. (1979). Coding of a source with unknown but ordered probabilities. Problems Inform. Transmission 15 134–138. · Zbl 0435.94010
[54] Scholl, H. R. (1998). Shannon optimal priors on IID statistical experiments converge weakly to Jeffreys’ prior. Test 7 75–94. · Zbl 1061.62507 · doi:10.1007/BF02565103
[55] Seidenfeld, T. (1986). Entropy and uncertainty. Philos. Sci. 53 467–491. JSTOR: · doi:10.1086/289336
[56] Shimony, A. (1985). The status of the principle of maximum entropy. Synthèse 63 35–53. [Reprinted as Chapter 8 of shimony:book1?.] · doi:10.1007/BF00485954
[57] Shimony, A. (1993). Search for a Naturalistic World View 1 . Cambridge Univ. Press.
[58] Skyrms, B. (1985). Maximum entropy inference as a special case of conditionalization. Synthèse 63 55–74. · doi:10.1007/BF00485955
[59] Stoer, J. and Witzgall, C. (1970). Convexity and Optimization in Finite Dimensions. I . Springer, Berlin. · Zbl 0203.52203
[60] Stroock, D. W. (1993). Probability Theory, an Analytic View . Cambridge Univ. Press. · Zbl 0925.60004
[61] Topsøe, F. (1979). Information-theoretical optimization techniques. Kybernetika 15 8–27. · Zbl 0399.94007
[62] Topsøe, F. (2001). Basic concepts, identities and inequalities—the toolkit of information theory. Entropy 3 162–190. Available at www.mdpi.org/entropy/. · Zbl 1006.94012 · doi:10.3390/e3030162
[63] Topsøe, F. (2002). Maximum entropy versus minimum risk and applications to some classical discrete distributions. IEEE Trans. Inform. Theory 48 2368–2376. · Zbl 1062.94532 · doi:10.1109/TIT.2002.800479
[64] Uffink, J. (1995). Can the maximum entropy principle be explained as a consistency requirement? Stud. Hist. Philos. Sci. B Stud. Hist. Philos. Modern Phys. 26 223–262. · Zbl 1222.00018 · doi:10.1016/1355-2198(95)00015-1
[65] Uffink, J. (1996). The constraint rule of the maximum entropy principle. Stud. Hist. Philos. Sci. B Stud. Hist. Philos. Modern Phys. 27 47–79. · Zbl 1222.82010 · doi:10.1016/1355-2198(95)00022-4
[66] van Fraassen, B. (1981). A problem for relative information minimizers in probability kinematics. British J. Philos. Sci. 32 375–379. JSTOR: · doi:10.1093/bjps/32.4.375
[67] Vidakovic, B. (2000). Gamma-minimax: A paradigm for conservative robust Bayesians. Robust Bayesian Analysis . Lecture Notes in Statist. 152 241–259. Springer, New York. · Zbl 1281.62040
[68] Walley, P. (1991). Statistical Reasoning with Imprecise Probabilities . Chapman and Hall, London. · Zbl 0732.62004
[69] Xie, Q. and Barron, A. R. (2000). Asymptotic minimax regret for data compression, gambling, and prediction. IEEE Trans. Inform. Theory 46 431–445. · Zbl 0996.94025 · doi:10.1109/18.825803
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.