Information theory, machine learning and artificial intelligence have
been overlapping fields during their whole existence as academic
disciplines. These areas, in turn, overlap significantly with applied
and theoretical statistics.
Arguably the most central concepts in
information theory are: entropy, mutual information and relative
entropy (KullbackLeibler divergence). These entities are important
also in inference and learning, for example via their manifestation
in the evidence lower bound (variational free energy). Entropy and
mutual information also play important parts in a class of general
bounds to error probability in estimation and decisionmaking, where
the most basic special case is known as Fano's inequality. Relative
entropy was introduced in parallel in the statistics and information
theory literature, and is a special case of the more general concept
of fdivergence. Divergence is in general an important measure
of "statistical dissimilarity," and plays an fundamental part in
several bounding techniques. A more recent framework that has caught
considerable attention is the information bottleneck principle, which
in turn has several interesting connections to traditional
ratedistortion theory.
This course will explore these, and several other, relations and
tools at some depth. The goal is to give PhD students in decision and
control, learning, AI, network science, and information theory a solid
introduction to how informationtheoretic concepts and tools can be applied to
problems in statistics, decision and learning well beyond their more
traditional use in, for example, communication theory.
The course is registered as FEO3350 and is worth 12 cu's.
Teachers
Mikael Skoglund
and Tobias Oechtering
Prerequisites
Required: Solid working knowledge (at the "advanced undergrad level")
in analysis, linear algebra and probability
Recommended: Information theory, corresponding to FEO3210; (measure theoretic) probability, corresponding to FEO3230; optimization, corresponding to SF3847
Material
Teaching the course will draw from several different sources. The
following is a partial list of recommended textbooks, tutorials,
lecturenotes and papers.
 [CT] Cover & Thomas, "Elements of Information Theory"
 [MK] MacKay, "Information Theory, Inference and Learning Algorithms"
 [CK] Csiszar & Körner, "Information Theory"
 [CS] Csiszar & Shields, "Information Theory and Statistics: A
Tutorial"
 [W] Y. Wu, "Informationtheoretic methods for highdimensional statistics" (lecture notes, Yale)
 [PW] Polyanskiy & Wu, "Lecture notes on information theory" (lecture notes, MIT)
 [D] Duchi, "Information theory and statistics" (lecture notes, Stanford)
 [WV] Wu & Verdú, "Rényi information dimension: Fundamental
limits of almost lossless analog compression,'' IEEE Trans. on IT Aug. 2010
 [V] Vapnik, "The nature of statistical learning theory," Springer 1995. (Can be accessed from KTH via Springer Link)
 [XR] Xu & Raginsky, "Informationtheoretic analysis of generalization capability of learning algorithms," in Proc. NIPS 2017 (se also this lecture by Raginsky)
 [RWY] G. Raskutti, M. J. Wainwright and B. Yu, "Minimax
rates of estimation for highdimensional linear regression
over $\ell_q$ balls," IEEE Trans. on IT, Oct. 2011
 [CRT] Candés, Romberg & Tao, "Stable signal recovery from incomplete and inaccurate measurements," Commun. Pure and Applied Mathematics, August 2006
 [GP] Z. Goldfeld and Y. Polyanskiy, "The information
bottleneck problem and its applications in machine learning," IEEE
J. Select. Areas in Information Theory, May 2020
Preliminary Schedule 2020
 Lecture 1, November 6: Information theory fundamentals: Entropy, mutual information,
relative entropy, and fdivergence. Total variation and
other distance metrics. Inequalities. [CT,PW,W]
 Lecture 2, November 13: RateDistortion theory: Cost versus information. Bounds. The Blahut
algorithm. [CT,PW]
 Lecture 3: Limits on information flow and processing: Conditional
mutual information and relative entropy. Data processing
inequalities. Sufficient statistic and the information
bottleneck. Ratedistortion interpretation [CT,PW,W,GP]
 Lecture 4: Foundations of statistical decision theory: Parameter
estimation. Bayes and minimax risk. Binary hypothesis testing [PW,W]
 Lecture 5: Information bounds on error probability and risk: Sample complexity. The mutual information method and ratedistortion. Fano inequalities. [W,PW,D]
 Lecture 6: Learning and generalization: Information bounds on generalization error. VC dimension and complexity. [XR,D,V]
 Lecture 7: Variational methods: Variational characterization of
divergence, DonskerVaradhan [PW,W]. Variational inference and the
ELBO [MK]
 Lecture 8: Classical estimation theory: Maximum likelihood, Fischer
information, information bounds, CramérRao, HammersleyChapmanRobbins. [CT,W,PW,D,MK]
 Lecture 9: Packing, covering, Fano & minimax risk, metric entropy [W,D]
 Lecture 10: Le Cam's method, Assouad's method, mutual information method
continued. Density estimation. Functional estimation. [W,D]
 Lecture 11: Dimension compression and denoising: Sparse denoising,
compressed sensing, almost lossless analog compression [W,D,RWY,CRT,WV]
 Lecture 12: The method of types [CT,CK,CS]
 Lecture 13: Information theory and large deviations, Stein, Chernoff and
Sanov. Total variation and hypothesis testing. [CT,CK,CS,PW]
 Lecture 14: The geometry of information: Information geometry, information
projection, iterative methods, ExpectationMaximization
[CK,CS,W,PW,D]
The first meeting (on November 6) is held in Room Q2 (Malvinas Väg 10), starting at 9:30
Nov 16: WE ARE PAUSING THE COURSE AFTER LEC 2 BECAUSE OF THE DEVELOPMENT OF THE PANDEMIC, AND THE NEW CAP ON NUMBER OF PEOPLE PER GATHERING
The course is paused until further notice. The assumption is that we can hopefully hold another 12 meetings before Christmas, but we may need to pause until after the Holidays. Check back here for updates. (The new deadline for HW 2 corresponds to the new date for Lec 3, when known.)
Downloads
Lecture slides and homework problems will be posted here.
