FEO3350 Information Theory for Statistics and Learning

Information theory, machine learning and artificial intelligence have been overlapping fields during their whole existence as academic disciplines. These areas, in turn, overlap significantly with applied and theoretical statistics.

Arguably the most central concepts in information theory are: entropy, mutual information and relative entropy (Kullback-Leibler divergence). These entities are important also in inference and learning, for example via their manifestation in the evidence lower bound (variational free energy). Entropy and mutual information also play important parts in a class of general bounds to error probability in estimation and decision-making, where the most basic special case is known as Fano's inequality. Relative entropy was introduced in parallel in the statistics and information theory literature, and is a special case of the more general concept of f-divergence. Divergence is in general an important measure of "statistical dissimilarity," and plays an fundamental part in several bounding techniques. A more recent framework that has caught considerable attention is the information bottleneck principle, which in turn has several interesting connections to traditional rate-distortion theory.

This course will explore these, and several other, relations and tools at some depth. The goal is to give PhD students in decision and control, learning, AI, network science, and information theory a solid introduction to how information-theoretic concepts and tools can be applied to problems in statistics, decision and learning well beyond their more traditional use in, for example, communication theory.

The course is registered as FEO3350 and is worth 12 cu's.

Teachers

Mikael Skoglund and Tobias Oechtering

Prerequisites

Required: Solid working knowledge (at the "advanced undergrad level") in analysis, linear algebra and probability

Recommended: Information theory, corresponding to FEO3210; (measure theoretic) probability, corresponding to FEO3230; optimization, corresponding to SF3847

Material

Teaching the course will draw from several different sources. The following is a partial list of recommended textbooks, tutorials, lecture-notes and papers.

[PW] Polyanskiy & Wu, "Information Theory: From coding to learning" (book draft)
[CT] Cover & Thomas, "Elements of Information Theory", Wiley. (can be accessed from Wiley via the KTH library)
[MK] MacKay, "Information Theory, Inference and Learning Algorithms"
[CK] Csiszar & Körner, "Information Theory"
[CS] Csiszar & Shields, "Information Theory and Statistics: A Tutorial"
[CTu] Csiszar & Tusnady, Information Geomentry and Alternating Minimization Procedures, 1984
[C1] Csiszar, "I-Divergence geometry of probability distributions and minimization problems," The Annals of Probability, Feb. 1975
[C2] Csiszar, "Iterative Algorithms with an Information Geometry Background (lecture notes, Renyi Institute)
[D] Duchi, "Information theory and statistics" (lecture notes, Stanford)
[WV] Wu & Verdú, "Rényi information dimension: Fundamental limits of almost lossless analog compression,'' IEEE Trans. on IT Aug. 2010
[BBL] Bousquet, Boucheron & Lugosi, "Introduction to Statistical Learning Theory," Springer 2003
[Wa] Wainwright, "High-dimensional statistics," Cambridge 2019.
[XR] Xu & Raginsky, "Information-theoretic analysis of generalization capability of learning algorithms," in Proc. NIPS 2017
[HDGR] Hellström, Durisi, Guedj & Raginsky, "Generalization bounds: Perspectives from information theory and PAC-Bayes," arXiv:2309.04381, 2023
[RGTS] B. Rodríguez Gálvez, R. Thobaben and M. Skoglund, "An information-theoretic approach to generalization theory," arXiv:2408.13275, 2024
[RWY] G. Raskutti, M. J. Wainwright and B. Yu, "Minimax rates of estimation for high-dimensional linear regression over $\ell_q$ balls," IEEE Trans. on IT, Oct. 2011
[CRT] Candés, Romberg & Tao, "Stable signal recovery from incomplete and inaccurate measurements," Commun. Pure and Applied Mathematics, August 2006
[GP] Z. Goldfeld and Y. Polyanskiy, "The information bottleneck problem and its applications in machine learning," IEEE J. Select. Areas in Information Theory, May 2020

Preliminary Schedule 2023

MS = Mikael Skoglund, TO = Tobias Oechtering

Lecture 1 [MS], Nov. 10 (Fri), 10:00-12:00, HN: Information theory fundamentals: Entropy, mutual information, relative entropy, and f-divergence. Total variation and other distance metrics. Inequalities. [PW,CT]
Lecture 2 [MS], Nov. 17 (Fri), 10:00-12:00, HN: Rate-Distortion theory: Cost versus information. Bounds. The Blahut algorithm. [PW,CT]
Lecture 3 [MS], Nov. 23 (Thu), 10:00-12:00, HN: Limits on information flow and processing: Conditional mutual information and relative entropy. Data processing inequalities. Sufficient statistic and the information bottleneck. Rate-distortion interpretation [PW,CT,GP]
Lecture 4 [MS], Dec. 1 (Fri), 10:00-12:00, HN: Foundations of statistical decision theory: Binary hypothesis testing. Parameter estimation. Bayes and minimax risk. [PW,CT]
Lecture 5 [MS], Dec. 8 (Fri), 10:00-12:00, HN: Information bounds on error probability and risk: Sample complexity. The mutual information method and rate-distortion. Fano inequalities. [PW,D]
Lecture 6 [MS], Dec. 15 (Fri), 10:00-12:00, GD: Learning and generalization: Information bounds on generalization error. VC dimension and complexity. [HDGR,XR,BBL]
Lecture 7 [MS]: Dec. 21 (Thu), 10:00-12:00, HN: Variational methods: Variational characterization of divergence, Donsker-Varadhan [PW]. Variational inference and the ELBO [MK,PW]
Lecture 8 [TO]: Jan. 12 (Fri), 10:00-12:00, HN: Classical estimation theory: Maximum likelihood, Fischer information, information bounds, Cramér-Rao, Hammersley-Chapman-Robbins. [CT,PW,D,S,MK]
Lecture 9 [TO]: Jan. 19 (Fri), 10:00-12:00, HN: Packing, covering, Fano & minimax risk, metric entropy [PW,D,Wa]
Lecture 10 [TO]: Jan. 26 (Fri), 10:00-12:00, HN: Le Cam's method, Assouad's method, mutual information method continued. Density estimation. Functional estimation. [PW,D,Wa]
Lecture 11 [MS]: Feb. 2 (Fri), 10:00-12:00, HN: Dimension compression and denoising: Sparse denoising, compressed sensing, almost lossless analog compression [PW,D,RWY,CRT,WV]
Lecture 12 [TO]: Feb. 8 (Thu), 10:00-12:00, HN: The method of types [CT,CK,CS]
Lecture 13 [TO]: Feb. 15 (Thu), 16:00-18:00, HN: Information theory and large deviations, Stein, Chernoff and Sanov. Total variation and hypothesis testing. [CT,CK,CS,PW]
Lecture 14 [MS]: Feb. 23 (Fri), 10:00-12:00, HN: The geometry of information: Information geometry, information projection, iterative methods, Expectation-Maximization [C1,C2,CTu,CK,CS]

Meeting rooms:

HN = Harry Nyquist, Malvinas väg 10, floor 7
GD = Gustav Dahlander, Teknikringen 31, floor 3

Downloads

Lecture slides and homework problems will be posted here.