Shannon's Entropy Formula
Reference for Shannon entropy H(X) = -Σ p(x) log₂ p(x), measuring information in bits.
Covers data compression, cryptography, and feature selection.
The Formula
Shannon's entropy measures the average amount of information (in bits) per symbol in a message. Higher entropy means more unpredictability and more bits needed to encode the data.
Variables
| Symbol | Meaning |
|---|---|
| H | Entropy (measured in bits when using log base 2) |
| p(x) | Probability of each possible symbol or outcome |
| Σ | Sum over all possible symbols |
| log₂ | Logarithm base 2 |
Example 1
Find the entropy of a fair coin flip
Two outcomes: Heads (p = 0.5), Tails (p = 0.5)
H = -(0.5 × log₂(0.5) + 0.5 × log₂(0.5))
H = -(0.5 × (-1) + 0.5 × (-1))
H = 1 bit (maximum entropy for two outcomes)
Example 2
A source emits A (70%), B (20%), C (10%). Find the entropy.
H = -(0.7 × log₂(0.7) + 0.2 × log₂(0.2) + 0.1 × log₂(0.1))
H = -(0.7 × (-0.515) + 0.2 × (-2.322) + 0.1 × (-3.322))
H = -(−0.360 − 0.464 − 0.332)
H ≈ 1.157 bits per symbol
When to Use It
Use Shannon's entropy when:
- Measuring the information content of a data source
- Designing efficient data compression algorithms
- Evaluating the randomness or predictability of data
- Building decision trees in machine learning (information gain)
Key Notes
- Formula: H = −Σ p(x) log₂ p(x): Sum over all possible outcomes x. The log base 2 gives entropy in bits. Using natural log gives nats; log base 10 gives hartleys (dits).
- Maximum entropy means maximum uncertainty: Entropy is maximized when all outcomes are equally likely (uniform distribution). A fair coin (H = 1 bit) has more entropy than a biased coin.
- Zero entropy means certainty: If one outcome has probability 1 and all others 0, entropy is 0 bits — there is no uncertainty at all.
- Foundation of data compression: Shannon's theorem shows no lossless compression scheme can compress a message below its entropy rate. ZIP, MP3, and JPEG all approach this theoretical limit.
- Used in machine learning: Decision trees use information gain (reduction in entropy) to choose which feature to split on. Cross-entropy is the standard loss function for classification models.