Probability
Information Theory
Entropy & Information

Information Theory & Entropy

πŸ’Ύ

What does probability have to do with computer science? In 1948, Claude Shannon realized that probability is the fundamental tool for measuring Information.

The Intuition: Surprise!

Information is the measure of how much "surprise" a piece of data gives you.

  • If I tell you "it's sunny in the Sahara Desert," I have given you zero information. You already knew that.
  • If I tell you "it's snowing in the Sahara Desert," I have given you a massive amount of information. You could not have predicted that.

The less probable an event is, the more information it carries.


1. Information Content (I(x)I(x))

For a single outcome xx with probability P(x)P(x), its information content is:

I(x)=log⁑2(1P(x))=βˆ’log⁑2P(x)I(x) = \log_2 \left( \frac{1}{P(x)} \right) = - \log_2 P(x)

We use log⁑2\log_2 because we measure information in bits. One bit of info corresponds to an event with probability 0.50.5 (like a fair coin flip).


2. Shannon Entropy (HH)

Entropy is the average amount of information produced by a probability distribution. It measures the total uncertainty of a system.

H(X)=E[I(X)]=βˆ’βˆ‘x∈XP(x)log⁑2P(x)H(X) = E[I(X)] = - \sum_{x \in X} P(x) \log_2 P(x)

Why Does This Matter?

  1. Compression: Entropy sets a hard physical limit on how much a file can be compressed. You can never compress a file smaller than its entropy without losing data.
  2. Machine Learning: In Neural Networks, we use Cross-Entropy Loss to measure how different our predicted probability distribution is from the true labels.

3. KL-Divergence: The "Distance" between Distributions

Kullback-Leibler (KL) Divergence measures how much information is lost when we use one distribution QQ to approximate another distribution PP.

DKL(Pβˆ₯Q)=βˆ‘P(x)log⁑(P(x)Q(x))D_{KL}(P \parallel Q) = \sum P(x) \log \left( \frac{P(x)}{Q(x)} \right)

If PP and QQ are identical, the KL-Divergence is zero. The further apart they are, the larger the divergence. This is the heart of training AI models to "match" the distribution of human-generated data.

Test Your Knowledge

Example: Calculating Entropy

Suppose you have a weighted coin where P(Heads)=0.9P(\text{Heads}) = 0.9 and P(Tails)=0.1P(\text{Tails}) = 0.1. Calculate the Entropy of this coin.

View Step-by-Step Solution

Entropy H=βˆ’(0.9log⁑20.9+0.1log⁑20.1)H = - (0.9 \log_2 0.9 + 0.1 \log_2 0.1) Hβ‰ˆβˆ’(0.9Γ—βˆ’0.152+0.1Γ—βˆ’3.32)H \approx - (0.9 \times -0.152 + 0.1 \times -3.32) Hβ‰ˆβˆ’(βˆ’0.137βˆ’0.332)=0.469Β bitsH \approx - (-0.137 - 0.332) = 0.469 \text{ bits}

A fair coin has 1 bit of entropy. Because this coin is highly predictable (90% heads), it carries less information/uncertainty.