Information theory, machine learning and artificial intelligence have
been overlapping fields during their whole existence as academic
disciplines. These areas, in turn, overlap significantly with applied
and theoretical statistics.
Arguably the most central concepts in
information theory are: entropy, mutual information and relative
entropy (Kullback-Leibler divergence). These entities are important
also in inference and learning, for example via their manifestation
in the evidence lower bound (variational free energy). Entropy and
mutual information also play important parts in a class of general
bounds to error probability in estimation and decision-making, where
the most basic special case is known as Fano's inequality. Relative
entropy was introduced in parallel in the statistics and information
theory literature, and is a special case of the more general concept
of f-divergence. Divergence is in general an important measure
of "statistical dissimilarity," and plays an fundamental part in
several bounding techniques. A more recent framework that has caught
considerable attention is the information bottleneck principle, which
in turn has several interesting connections to traditional
This course will explore these, and several other, relations and
tools at some depth. The goal is to give PhD students in decision and
control, learning, AI, network science, and information theory a solid
introduction to how information-theoretic concepts and tools can be applied to
problems in statistics, decision and learning well beyond their more
traditional use in, for example, communication theory.
The course is registered as FEO3350 and is worth 12 cu's.
and Tobias Oechtering
Required: Solid working knowledge (at the "advanced undergrad level")
in analysis, linear algebra and probability
Recommended: Information theory, corresponding to FEO3210; (measure theoretic) probability, corresponding to FEO3230; optimization, corresponding to SF3847
Teaching the course will draw from several different sources. The
following is a partial list of recommended textbooks, tutorials,
lecture-notes and papers.
- [CT] Cover & Thomas, "Elements of Information Theory", Wiley. (can be accessed from Wiley via the KTH library)
- [MK] MacKay, "Information Theory, Inference and Learning Algorithms"
- [CK] Csiszar & Körner, "Information Theory"
- [CS] Csiszar & Shields, "Information Theory and Statistics: A
- [CTu] Csiszar & Tusnady, Information Geomentry and Alternating Minimization Procedures, 1984
- [C1] Csiszar, "I-Divergence geometry of probability distributions and minimization problems," The Annals of Probability, Feb. 1975
- [C2] Csiszar, "Iterative Algorithms with an Information Geometry Background (lecture notes, Renyi Institute)
- [W] Y. Wu, "Information-theoretic methods for high-dimensional statistics" (lecture notes, Yale)
- [PW] Polyanskiy & Wu, "Lecture notes on information theory" (lecture notes, MIT)
- [D] Duchi, "Information theory and statistics" (lecture notes, Stanford)
- [WV] Wu & Verdú, "Rényi information dimension: Fundamental
limits of almost lossless analog compression,'' IEEE Trans. on IT Aug. 2010
- [S] Schervish, "Theory of statistics," Springer 1995. (Can be accessed from KTH via Springer Link)
- [V] Vapnik, "The nature of statistical learning theory," Springer 1995. (Can be accessed from KTH via Springer Link)
- [BBL] Bousquet, Boucheron & Lugosi, "Introduction to Statistical Learning Theory," Springer 2003
- [Wa] Wainwright, "High-dimensional statistics," Cambridge 2019.
- [XR] Xu & Raginsky, "Information-theoretic analysis of generalization capability of learning algorithms," in Proc. NIPS 2017 (se also this lecture by Raginsky)
- [RWY] G. Raskutti, M. J. Wainwright and B. Yu, "Minimax
rates of estimation for high-dimensional linear regression
over $\ell_q$ balls," IEEE Trans. on IT, Oct. 2011
- [CRT] Candés, Romberg & Tao, "Stable signal recovery from incomplete and inaccurate measurements," Commun. Pure and Applied Mathematics, August 2006
- [GP] Z. Goldfeld and Y. Polyanskiy, "The information
bottleneck problem and its applications in machine learning," IEEE
J. Select. Areas in Information Theory, May 2020
Preliminary Schedule 2020-21
MS = Mikael Skoglund, TO = Tobias Oechtering
- Lecture 1, November 6 [MS]: Information theory fundamentals: Entropy, mutual information,
relative entropy, and f-divergence. Total variation and
other distance metrics. Inequalities. [CT,PW,W]
- Lecture 2, November 13 [MS]: Rate-Distortion theory: Cost versus information. Bounds. The Blahut
- Lecture 3, February 12 (starting at 10:00, Q2) [MS]: Limits on information flow and processing: Conditional mutual information and relative entropy. Data processing
inequalities. Sufficient statistic and the information
bottleneck. Rate-distortion interpretation [CT,PW,W,GP]
- Lecture 4, February 19 (starting at 15:00, Q2) [MS]: Foundations of statistical decision theory: Parameter estimation. Bayes and minimax risk. Binary hypothesis testing [PW,W,S]
- Lecture 5, February 26 (starting at 10:00, Q2) [MS]: Information bounds on error probability and risk: Sample complexity. The mutual information method and rate-distortion. Fano inequalities. [W,PW,D]
- Lecture 6, April 26 (starting at 10:00 over Zoom) [MS]: Learning and generalization: Information bounds on generalization error. VC dimension and complexity. [XR,BBL,D,V]
- Lecture 7, May 3 (starting at 10:00 over Zoom) [MS]: Variational methods: Variational characterization of divergence, Donsker-Varadhan [PW,W]. Variational inference and the
- Lecture 8, May 11 (10:00 over Zoom) [TO]: Classical estimation theory: Maximum likelihood, Fischer information, information bounds, Cramér-Rao, Hammersley-Chapman-Robbins. [CT,W,PW,D,S,MK]
- Lecture 9, May 17 (14:00 over Zoom) [TO]: Packing, covering, Fano & minimax risk, metric entropy [W,D,Wa]
- Lecture 10, May 24 (10:00 over Zoom) [TO]: Le Cam's method, Assouad's method, mutual information method continued. Density estimation. Functional estimation. [W,D,Wa]
- Lecture 11, May 31 (13:00 over Zoom) [MS]: Dimension compression and denoising: Sparse denoising, compressed sensing, almost lossless analog compression [W,D,RWY,CRT,WV]
- Lecture 12, June 8 (10:00 over Zoom) [TO]: The method of types [CT,CK,CS]
- Lecture 13, June 14 (10:00, Q2) [TO]: Information theory and large deviations, Stein, Chernoff and Sanov. Total variation and hypothesis testing. [CT,CK,CS,PW]
- Lecture 14, June 21 (10:00, Q2) [MS]: The geometry of information: Information geometry, information projection, iterative methods, Expectation-Maximization
The first meeting (on November 6) is held in Room Q2 (Malvinas Väg 10), starting at 9:30
APR 16, 2021: We start again on April 26, over Zoom. We will give at least the next two lectures (6 and 7) over Zoom. The links are posted in the schedule above. For HW problems 5 and 6 (due April 26 and May 3), please scan/photograph/typeset your solutions and send in an email to Mikael: firstname.lastname@example.org. Please also indicate which problems you would have been prepared to present in class.
MAY 17, 2021: Our present assumption is that we will go back to physical meetings starting from Lec 11 (May 31). We will announce here at the latest on May 28 whether Lec 11 will be over Zoom or physical.
MAY 26, 2021: We have decided to give also Lecs 11 and 12 over Zoom. Most likely the final two lectures will be given over Zoom too. Decision to be announced here later.
JUNE 9, 2021: We will give the two final lectures physically, in room Q2. See the schedule above.
Lecture slides and homework problems will be posted here.
lecture 1, homework 1
lecture 2, homework 2
lecture 3, homework 3
lecture 4, homework 4
lecture 5, homework 5
lecture 6, homework 6
lecture 7, homework 7
lecture 8, homework 8
lecture 9, homework 9
lecture 10, homework 10
lecture 11, homework 11
lecture 12, homework 12
lecture 13, homework 13