Intro to Information Theory: Claude Shannon, Entropy, Redundancy, Data Compression & Bits
Summary of Previous Video:
In the previous video I discussed several definitions of information and I mainly concentrated on Gregory Bateson’s definition, which describes information as “a difference that makes a difference”. I modified his definition to information is a “perceived difference that can make a difference” and discussed how information has the potential to be used as the basic currency for describing reality.
Reality is a subjective experience, it is perception, it is qualia. In this way, information represents the connection between the observer and the observed, it encapsulates the link between the subject and the object, the relationship between what we conceptualise as the external world and what it feels like to experience this external world from within.
Therefore, using information to describe reality also embodies the recognition that, in many contexts, it doesn’t make sense anymore to speak of things in themselves, but that all we can possibly describe are perceived distinctions, perceived properties, perceived patterns and regularities in Nature.
Information has the potential to be assigned a meaning and therefore it has the potential to become knowledge. Meaning is subjective so information doesn’t really have objective intrinsic meaning.
Introduction to Information Theory:
Now, you may be wondering what science has to say about all this. How does science currently define information? Has science been able to objectively quantify information? The answer to these questions lies at the heart of Information Theory.
Information Theory is a branch of mathematics and computer science which studies the quantification of information. As you have probably realised by now, the concept of information can be defined in many different ways. Clearly, if we want to quantify information, we need to use definitions which are as objective as possible.
While subjectivity can never be completely removed from the equation (reality is, after all, always perceived and interpreted in a subjective manner) we will now explore a definition of information that is much more technical and objective than the definitions we discussed in the previous video.
This is Claude Shannon, an American mathematician and electronic engineer who is now considered the "Father of Information Theory". While working at Bell Laboratories, he formulated a theory which aimed to quantify the communication of information.
Shannon's article titled "A Mathematical Theory of Communication", published in 1948, as well as his book "The Mathematical Theory of Communication", co-written with mathematician Warren Weaver and published in 1949, truly revolutionised science.
Shannon's theory tackled the problem of how to transmit information most efficiently through a given channel as well as many other practical issues such as how to make communication more secure (for instance, how to tackle unauthorised eavesdropping).
Thanks to Shannon's ideas on signal processing, data compression, as well as data storage and communication, useful applications have been found in many different areas. For instance, lossless data compression is used in ZIP files, while lossy data compression is used in other types of files such as MP3s or JPGs. Other important applications of Information Theory are within the fields of cryptography, thermal physics, neurobiology or quantum computing, to name just a few.
It is important to realise that Shannon's Theory was created to find ways to optimise the physical encoding of information, to find fundamental limits on signal processing operations. Because of this, Shannon's definition of information was originally associated with the actual symbols which are used to encode a message (for instance, the letters or words contained in a message), and not intended to relate to the actual interpretation, assigned meaning or importance of the message.
As we have seen in the previous video, information itself does not really have intrinsic objective meaning. What Shannon's Theory aimed to tackle were practical issues completely unrelated to qualitative properties such as meaningfulness, but more related to the actual encoding used in the transmission of a particular message.
Therefore, what we are dealing with here is a very different definition of information than those we have discussed so far. Shannon defines information as a purely quantitative measure of communication exchanges. As we will see, Shannon's definition represents a way to measure the amount of information that can potentially be gained when one learns of the outcome of a random process. And it is precisely this probabilistic nature of Shannon's definition that turns out to be a very useful tool; one that physicists (and other science disciplines) now use in many different areas of research.
Shannon's information is in fact known as Shannon's entropy (Legend says that it was the mathematician John von Neumann who suggested that Shannon use this term, instead of information). In general, I will refer to Shannon's definition as Shannon's entropy, information entropy or Shannon's information, to avoid confusion with other definitions of information or with the concept of thermodynamical entropy.
Shannon's Entropy Formula – Uncertainty – Information Gain & Information Content:
So what is Shannon's definition then? Well, the amount of information contained in a given message is defined as the negative of a certain sum of probabilities. Don't worry, we are not going to deal with the actual details of the equation here, but instead I will give you a flavour of what it means. Suppose we have a probabilistic process which has a certain number of possible outcomes, each with a different probability of occurring.
Let's call the total number of possible outcomes N and the probabilities of each outcome p(1), p(2), p(3), ….., p(N). For instance, let's say that the probabilistic process we are dealing with is the throw of a coin. In this case, the total number of possible outcomes is 2, so N=2 (that is, heads and tails – let's call heads 1 and tails 2). The probabilities associated with each of these possible outcomes, assuming a fair coin, is ½. So we have p(1)= ½ and p(2)= ½ (which is the same as saying each possible outcome, heads or tails, has a 50% probability of occurring).
Shannon showed that the entropy (designated by the letter H) is equivalent to the potential information gain once the experimenter learns the outcome of the experiment, and is given by the following formula:
This formula implies that the more entropy a system has, the more information we can potentially gain once we know the outcome of the experiment. Shannon's entropy can be thought of as a way to quantify the potential reduction in our uncertainty once we have learnt the outcome of the probabilistic process.
Don't worry if you feel a bit confused right now. I'll explain the concept of Shannon's entropy for you in very easy terms. Let's go back to the coin example:
The coin is fair, so the probability of heads is the same as the probability of tails (that is, ½ each). Let's consider the event of throwing the coin. Plugging the given probabilities into the equation gives us a Shannon entropy, that is, an information content of one bit, because there are two possible outcomes and each has equal probability. Once we have thrown the coin and we have learnt its outcome, we can say that we have gained one bit of information, or alternatively, we can say that our uncertainty has been reduced by one bit.
Now imagine you have a coin which has two heads. In this case, N=1, that is, there is only one possible outcome. The likelihood of obtaining heads is therefore equal to 1 (a probability equal to 1 means absolute certainty, 100%). Since the uncertainty is zero in this case, Shannon's entropy is zero, and so is the information content. There is no longer the presence of two different alternatives here. The information we gain after throwing the coin is therefore, zero. Look at it this way: we already knew with certainty what was going to happen in advance, so there is no potential gain in information after learning the outcome.
Another way of describing Shannon's entropy is to say that it represents the amount of information the experimenter lacks prior to learning the outcome of a probabilistic process.
Hence, according to Shannon's formula, a message's entropy is maximised when the occurrence of each of its individual parts is equally probable. What this means is that we will gain the largest amount of Shannon's information when dealing with systems whose individual possible outcomes are equally likely to occur (for instance, throwing a fair coin or rolling a fair die, both systems having a set of possible outcomes which are all equally likely to occur).
Shannon's entropy is a measure of the potential reduction in uncertainty in the receiver's knowledge. We can see the process of gaining information as equivalent to the process of losing uncertainty. You may be wondering what all this has to do with the actual content and encoding of a message, since so far we have only been talking of coins. Here's another example, which illustrates the usefulness of Shannon's formula when it comes to written language.
Entropy per Character – Written English Language:
Can we estimate the information entropy of the written English language? Consider starting with one particular letter which is picked at random. Knowing this first letter, you then want to estimate the probability of getting another particular letter after that one, and the probability of getting another letter after that first and second one, and so on. Knowing these probabilities is what we need in order to calculate the information entropy associated with the English text.
If we assume we are dealing with 27 characters (that is, 26 letters plus space), and that all of these characters are equally probable, then we have an information entropy of about 4.8 bits per character. But we know that the characters are not equally probable; for instance, the letter E has the highest frequency of occurrence, while the letter Z has the lowest. This is related to the concept of redundancy, which is nothing more than the number of constraints imposed on the text of the English language: for example, the letter Q is always followed by U, and we also have rules such as "I before E except after C", and so on.
There are various methods for calculating the information entropy of the written English language. For instance, Shannon's methods – which take into account many factors, including redundancy and contextuality for instance – give the English language text an information entropy of between 0.6 and 1.3 bits per character.
So, if we compare this with the previous result of around 4.8 bits per character, we can see that the constraints imposed by factors such as redundancy have the overall effect of reducing the entropy per character. What this means is that finding the amount of redundancy in a language can help us find the minimum amount of information needed to encode a message. And of course, this leads us to the important concept of data compression.
Data Compression – Redundancy:
In information theory, the redundancy in a message is the number of bits used to encode it minus the number of bits of Shannon’s information contained in the message.
Redundancy in a message is related to the extent to which it is possible to compress it. What lossless data compression does is reduce the number of bits used to encode a message by identifying and eliminating statistical redundancy . Hence:
The more redundancy there is in a message, the more predictability we have à that means less entropy per encoded symbol à hence the higher the compressibility
When we compress data, we extract redundancy. When we compress a message, what we do is encode the same amount of Shannon’s information by using less bits. Hence, we end up having more Shannon’s information per encoded symbol, more Shannon’s information per bit of encoded message. A compressed message is less predictable, since the repeated patterns have been eliminated; the redundancy has been removed.
In fact, Shannon’s entropy represents a lower limit for lossless data compression: the minimum amount of bits that can be used to encode a message without loss. Shannon's source coding theorem states that a lossless data compression scheme cannot compress messages, on average, to have more than one bit of Shannon’s information per bit of encoded message.
Calculating the redundancy and the information entropy of the English language has therefore many practical applications. For instance, ASCII codes (which are codes that represent text in computers, communications equipment, and other devices that use text) allocate exactly 8 bits per character.
But this is very inefficient when we consider Shannon's and other similar calculations which, as we have seen, give us an information entropy of around 1 bit per character. Put another way, a smaller amount of bits can be used to store the same text. What this means, is that, in theory, there exists a compression scheme which is 8 times more effective than ASCII.
Let’s use an example to see how lossless data compression works. We will use Huffman coding, an algorithm developed by electrical engineer David Huffman in 1952. Huffman coding is a variable length code which assigns codes using the estimated probability of occurrence of each source symbol. For instance, let’s say we take as symbols all the letters of the English alphabet plus space. Huffman coding assigns codes with varying length depending on the frequency of occurrence of each symbol. Just as with Morse code, the most frequent symbols are assigned the shortest codes and the less frequent symbols are assigned the longest codes.
Let’s take a piece of text written in English, which is long enough so that we can approximate our calculations by using standard frequency tables for the letters of the written English language. The most frequent symbols, such as space and the letter e will be assigned the shortest codes, while the least frequent symbols, such as the letters q and z will be assigned the longest codes. Applying the Huffman algorithm using standard frequency tables, we obtain the Huffman codes given on this table. As you can see, the lengths of the codes vary from 3 to 11 bits, depending on the character.
We are going to use this short passage, from Brave New World, by Aldous Huxley:
“All right then," said the savage defiantly, I'm claiming the right to be unhappy."
"Not to mention the right to grow old and ugly and impotent; the right to have syphilis and cancer; the right to have too little to eat, the right to be lousy; the right to live in constant apprehension of what may happen tomorrow; the right to catch typhoid; the right to be tortured by unspeakable pains of every kind."
There was a long silence.
"I claim them all," said the Savage at last.”
Excluding punctuation and just counting letters and spaces, we have a total of 462 characters to encode. If we encode this piece of text using ASCII, we will need to use a total of 3,696 bits (8 bits per character). However, if we use Huffman code, we will only need to use a total of 1,883 bits (an average of about 4.1 bits per character).
So we see that by using Huffman encoding instead of ASCII, we can store the same amount of information, but twice as effectively. Although Huffman coding is more efficient, we are still far from reaching the limit given by Shannon’s entropy, which as we saw earlier on, can be approximated to be around 1 bit per character.
By using Huffman’s code, we have managed to extract a certain amount of redundancy, but we are still far from the limit where all the redundancy would have been extracted: that limit is Shannon’s entropy.
There are however many other compression techniques; for instance, there is a technique called arithmetic coding, which can extract a lot more redundancy than Huffman coding, and hence it can create compressed messages where the average number of bits per character is much closer to Shannon’s entropy.
So by using this particular example, we have seen how the concept of Shannon's entropy, in this case calculated from the probabilities of occurrence associated with the letters belonging to the words of a particular language, has very important applications; data compression being one of them. Summarising:
Shannon's entropy is a measure of uncertainty, of unpredictability, and also a measure of information content, of potential information gain.
Shannon’s entropy can also represent a lower limit for lossless data compression: the minimum amount of bits that can be used to encode a message without loss.
Also note that with this definition, more information content has nothing to do with its quality. So in this sense, a larger amount of Shannon's entropy does not necessarily imply a better quality of its content (an example of two subjective concepts which could be linked to quality are meaningfulness or importance).
I will expand on this topic in the following video, when we discuss Norbert Wiener’s ideas.
Shannon's Bit vs Storage Bit:
Now, there is one point that needs clarifying. In this past section of the video, while discussing concepts such as compression and redundancy, we have actually been talking about different kinds of bits, that is, bits which represent different concepts.
You may recall that in the previous video we defined the bit as a variable which can have two possible values, which we represent by the digits 0 and 1. This is the most popular definition, one that is usually associated with the storage or transmission of encoded data. In this way, one bit is the capacity of a system which can exist in only two states.
In information theory, however, the bit can be defined in a different way. As we have seen, it can be a unit of measurement for Shannon’s information. In this context, one bit is defined as the uncertainty associated with a binary random variable that can be in one of two possible states with equal probability.
Put another way, one Shannon bit is the amount of entropy that is present in the selection of two equally probable options, it is the information that is gained when the value of this variable becomes known. Remember, this is exactly what we showed earlier on when we applied Shannon’s formula to the throw of a fair coin.
Well, it turns out that the first time the word “bit” appeared in a written document was precisely in Shannon’s ground-breaking paper “A mathematical theory of communication“. In it, Shannon clearly states that it was mathematician John Tukey, a fellow Bell Labs researcher, who coined the term “bit”, short for binary digit.
While Tukey’s binary digit is a unit related to the encoding of data, Shannon’s bit is a unit that is related to uncertainty, unpredictability. In one Tukey bit of encoded data there is often less than one bit of Shannon’s information. Why?
Well, Shannon’s entropy is maximised when all possible states are equally likely, and one Shannon bit is defined in relation to a system whose states are equally likely. Therefore, in a binary system, when the two possible states are not equally likely, such as the toss of a biased coin, Shannon’s information is less than 1 bit.
Recall Shannon's source coding theorem, which states that a lossless data compression scheme cannot compress messages, on average, to have more than one bit of Shannon’s information per bit of encoded message. Well, now you know that these bits are different kinds of bits, in the sense that they represent different concepts. The latter refers to the encoding, storage or transmission of binary data, redundant or not, whereas the former refers to amounts of Shannon’s information, as defined by his entropy equation. An encoded bit of data contains, at most, one bit of Shannon’s information, usually a lot less.
In information theory, therefore, it is important to understand the distinction between encoded binary data and information. In this context, the word information is used in reference to Shannon’s entropy, a much more abstract concept that does not equate with the actual encoding of zeros and ones.
Now, so far we have discussed the important concepts of entropy, uncertainty, information and redundancy… and we have seen how Shannon’s Source Coding Theorem provides us with a theoretical lower bound for data compression. Hopefully you now understand the relationship between entropy, uncertainty reduction, information gain, information content, redundancy and compression.
We have covered the most important theoretical aspects of Information Theory which establish the limits to how efficiently a particular message can be encoded.
But what does Shannon have to say about sending a message not only efficiently but also reliably though a given channel? How does one transmit information reliably in the presence of noise? That is, in the presence of physical disturbances which can introduce random errors in the encoding of a particular message, what are the theoretical limits to detecting and correcting these errors? This leads us to the very important topic of Error-Detection and Correction.
… to be continued 🙂
Coming up in Part 2b:
– Shannon’s Noisy Channel Coding Theorem
– Adding redundancy bits to a message (called parity or check bits) to detect and correct errors
– Error-correcting codes – What are Hamming codes?
– Example using a Hamming code
– What are doubly-even self-dual linear binary error-correcting block codes?
– James Gates discovery: error-correcting codes found buried inside the fundamental equations of physics?
– Little side journey: Cosmological and biological evolution. The Universe as a self-organised system. How does Nature store information? Nature’s use of redundancy and error-correcting codes. Information, the Laws of Physics and unknown organising principles: fixed or evolving? Biology, DNA and information.
References – Further Information:
Entropy & Information Gain Lecture – Nick Hawes (very good at clarifying difference between individual information of each outcome and the expectation value of information, that is, Shannon’s entropy)