Information Theory & Entropy
What does probability have to do with computer science? In 1948, Claude Shannon realized that probability is the fundamental tool for measuring Information.
The Intuition: Surprise!
Information is the measure of how much "surprise" a piece of data gives you.
- If I tell you "it's sunny in the Sahara Desert," I have given you zero information. You already knew that.
- If I tell you "it's snowing in the Sahara Desert," I have given you a massive amount of information. You could not have predicted that.
The less probable an event is, the more information it carries.
1. Information Content ()
For a single outcome with probability , its information content is:
We use because we measure information in bits. One bit of info corresponds to an event with probability (like a fair coin flip).
2. Shannon Entropy ()
Entropy is the average amount of information produced by a probability distribution. It measures the total uncertainty of a system.
Why Does This Matter?
- Compression: Entropy sets a hard physical limit on how much a file can be compressed. You can never compress a file smaller than its entropy without losing data.
- Machine Learning: In Neural Networks, we use Cross-Entropy Loss to measure how different our predicted probability distribution is from the true labels.
3. KL-Divergence: The "Distance" between Distributions
Kullback-Leibler (KL) Divergence measures how much information is lost when we use one distribution to approximate another distribution .
If and are identical, the KL-Divergence is zero. The further apart they are, the larger the divergence. This is the heart of training AI models to "match" the distribution of human-generated data.
Test Your Knowledge
Example: Calculating Entropy
Suppose you have a weighted coin where and . Calculate the Entropy of this coin.
View Step-by-Step Solution
Entropy
A fair coin has 1 bit of entropy. Because this coin is highly predictable (90% heads), it carries less information/uncertainty.