I felt like I finally understood Shannon entropy when I realized that it's a subjective quantity -- a property of the observer, not the observed.
The entropy of a variable X is the amount of information required to drive the observer's uncertainty about the value of X to zero. As a correlate, your uncertainty and mine about the value of the same variable X could be different. This is trivially true, as we could each have received different information that about X. H(X) should be H_{observer}(X), or even better, H_{observer, time}(X).
As clear as Shannon's work is in other respects, he glosses over this.
This doesn't really make entropy itself observer dependent. (Shannon) entropy is a property of a distribution. It's just that when you're measuring different observers' beliefs, you're looking at different distributions (which can have different entropies the same way they can have different means, variances, etc).
Right but in chemistry class the way it’s taught via Gibbs free energy etc. makes it seem as if it’s an intrinsic property.
Entropy in physics is usually the Shannon entropy of the probability distribution over system microstates given known temperature and pressure. If the system is in equilibrium then this is objective.
Entropy in Physics is usually either the Boltzmann or Gibbs entropy, both of whom were dead before Shannon was born.
That's not a problem, as the GP's post is trying to state a mathematical relation not a historical attribution. Often newer concepts shed light on older ones. As Baez's article says, Gibbs entropy is Shannon's entropy of an associated distribution(multiplied by the constant k).
It is a problem because all three come with a bagage. Almost none of the things discussed in this thread are invalid when discussing actual physical entropy even though the equations are superficially similar. And then there are lots of people being confidently wrong because they assume that it’s just one concept. It really is not.
Don't see how the connection is superficial. Even the classical macroscopic definition of entropy as ΔS=∫TdQ can be derived from the information theory perspective as Baez shows in article(using entropy maximizing distributions and Lagrange multipliers). If you have a more specific critique, it would be good to discuss.
In classical physics there is no real objective randomness. Particles have a defined position and momentum and those evolve deterministically. If you somehow learned these then the shannon entropy is zero. If entropy is zero then all kinds of things break down.
So now you are forced to consider e.g. temperature an impossibility without quantum-derived randomness, even though temperature does not really seem to be a quantum thing.
Which we don’t know precisely. Entropy is about not knowing.
Minus infinity. Entropy in classical statistical mechanics is proportional to the logarithm of the volume in phase space. (You need an appropriate extension of Shannon’s entropy to continuous distributions.)
Or you may study statistical mechanics :-)
No, it is not about not knowing. This is an instance of the intuition from Shannon’s entropy does not translate to statistical Physics.
It is about the number of possible microstates, which is completely different. In Physics, entropy is a property of a bit of matter, it is not related to the observer or their knowledge. We can measure the enthalpy change of a material sample and work out its entropy without knowing a thing about its structure.
No, 0. In this case, there is a single state with p=1 and and S = - k Σ p ln(p) = 0.
This is the same if you consider the phase space because then it is reduced to a single point (you need a bit of distribution theory to prove it rigorously but it is somewhat intuitive).
The probability p of an microstate is always between 0 and 1, therefore p ln(p) is always negative and S is always positive.
You get the same using Boltzmann’s approach, in which case Ω = 1 and S = k ln(Ω) is also 0.
Gibbs’ entropy.
Indeed.
Enthalpy is also dependent on your choice of state variables, which is in turn dictated by which observables you want to make predictions about: whether two microstates are distinguishable, and thus whether the part of the same macrostate, depends on the tools you have for distinguishing them.
Conditional on the known macrostate. Because we don’t know the precise microstate - only which microstates are possible.
If your reasoning is that « experimental entropy can be measured so it’s not about that » then it’s not about macrostates and microstates either!
Entropy is a macroscopic variable and if you allow microscopic information, strange things can happen! One can move from a high entropy macrostate to a low entropy macrostate if you choose the initial microstate carefully. But this is not a reliable process which you can reproduce experimentally, ie. it is not a thermodynamic process.
A thermodynamics process P is something which takes a macrostate A to a macrostate B, independent of which microstate a0, a1, a2.. in A you started off with it. If the process depends on microstate, then it wouldn't be something we would recognize as we are looking from the macro perspective.
that's actually the normal view, with saying both info and stat mech entropy are the same is an outlier, most popularized by Jaynes.
If information-theoretical and statistical mechanics entropies are NOT the same (or at least, deeply connected) then what stops us from having a little guy[0] sort all the particles in a gas to extract more energy from them?
[0] https://en.wikipedia.org/wiki/Maxwell%27s_demon
Sounds like a non-sequitur to me; what are you implying about the Maxwell's demon thought experiment vs the comparison between Shannon and stat-mech entropy?
Entropy is a property of a distribution, but since math does sometimes get applied, we also attach distributions to things (eg. the entropy of a random number generator, the entropy of a gas...). Then when we talk about the entropy of those things, those entropies are indeed subjective, because different subjects will attach different probability distributions to that system depending on their information about that system.
Some probability distributions are objective. The probability that my random number generator gives me a certain number is given by a certain formula. Describing it with another distribution would be wrong.
Another example, if you have an electron in a superposition of half spin-up and half spin-down, then the probability to measure up is objectively 50%.
Another example, GPT-2 is a probability distribution on sequences of integers. You can download this probability distribution. It doesn't represent anyone's beliefs. The distribution has a certain entropy. That entropy is an objective property of the distribution.
Of those, the quantum superposition is the only one that has a chance at being considered objective, and it's still only "objective" in the sense that (as far as we know) your description provided as much information as anyone can possibly have about it, so nobody can have a more-informed opinion and all subjects agree.
The others are both partial-information problems which are very sensitive to knowing certain hidden-state information. Your random number generator gives you a number that you didn't expect, and for which a formula describes your best guess based on available incomplete information, but the computer program that generated knew which one to choose and it would not have picked any other. Anyone who knew the hidden state of the RNG would also have assigned a different probability to that number being chosen.
A more plausible way to argue for objectiveness is to say that some probability distributions are objectively more rational than others given the same information. E.g. when seeing a symmetrical die it would be irrational to give 5 a higher probability than the others. Or it seems irrational to believe that the sun will explode tomorrow.
You might have some probability distribution in your head for what will come out of GPT-2 on your machine at a certain time, based on your knowledge of the random seed. But that is not the GPT-2 probability distribution, which is objectively defined by model weights that you can download, and which does not correspond to anyone’s beliefs.
The probability distribution is subjective for both parts -- because it, once again, depends on the observer observing the events in order to build a probability distribution.
E.g. your random number generator generates 1, 5, 7, 8, 3 when you run it. It generates 4, 8, 8, 2, 5 when I run it. I.e. we have received different information about the random number generator to build our subjective probability distributions. The level of entropy of our probability distributions is high because we have so little information to be certain about the representativeness of our distribution sample.
If we continue running our random number generator for a while, we will gather more information, thus reducing entropy, and our probability distributions will both start converging towards an objective "truth." If we ran our random number generators for a theoretically infinite amount of time, we will have reduced entropy to 0 and have a perfect and objective probability distribution.
But this is impossible.
Would you say that all claims about the world are subjective, because they have to be based on someone’s observations?
For example my cat weighs 13 pounds. That seems objective, in the sense that if two people disagree, only one can be right. But the claim is based on my observations. I think your logic leads us to deny that anything is objective.
"Entropy is a property of matter that measures the degree of randomization or disorder at the microscopic level", at least when considering the second law.
Right, but the very interesting thing is it turns out that what's random to me might not be random to you! And the reason that "microscopic" is included is because that's a shorthand for "information you probably don't have about a system, because your eyes aren't that good, or even if they are, your brain ignored the fine details anyway."
Yeah but distributions are just the accounting tools to keep track of your entropy. If you are missing one bit of information about a system, your understanding of the system is some distribution with one bit of entropy. Like the original comment said, the entropy is the number of bits needed to fill in the unknowns and bring the uncertainty down to zero. Your coin flips may be unknown in advance to you, and thus you model it as a 50/50 distribution, but in a deterministic universe the bits were present all along.
What's often lost in the discussions about whether entropy is subjective or objective is that, if you dig a little deeper, information theory gives you powerful tools for relating the objective and the subjective.
Consider cross entropy of two distributions H[p, q] = -Σ p_i log q_i. For example maybe p is the real frequency distribution over outcomes from rolling some dice, and q is your belief distribution. You can see the p_i as representing the objective probabilities (sampled by actually rolling the dice) and the q_i as your subjective probabilities. The cross entropy is measuring something like how surprised you are on average when you observe an outcome.
The interesting thing is that H[p, p] <= H[p, q], which means that if your belief distribution is wrong, your cross entropy will be higher than it would be if you had the right beliefs, q=p. This is guaranteed by the concavity of the logarithm. This gives you a way to compare beliefs: whichever q gets the lowest H[p,q] is closer to the truth.
You can even break cross entropy into two parts, corresponding to two kinds of uncertainty: H[p, q] = H[p] + D[q||p]. The first term is the entropy of p and it is the aleatoric uncertainty, the inherent randomness in the phenomenon you are trying to model. The second term is KL divergence and it tells you how much additional uncertainty you have as the result of having wrong beliefs, which you could call epistemic uncertainty.
Thanks, that's an interesting perspective. It also highlights one of the weak points in the concept, I think, which is that this is only a tool for updating beliefs to the extent that the underlying probability space ("ontology" in this analogy) can actually "model" the phenomenon correctly!
It doesn't seem to shed much light on when or how you could update the underlying probability space itself (or when to change your ontology in the belief setting).
I think what you're getting at is the construction of the sample space - the space of outcomes over which we define the probability measure (e.g. {H,T} for a coin, or {1,2,3,4,5,6} for a die).
Let's consider two possibilities:
1. Our sample space is "incomplete"
2. Our sample space is too "coarse"
Let's discuss 1 first. Imagine I have a special die that has a hidden binary state which I can control, which forces the die to come up either even or odd. If your sample space is only which side faces up, and I randomize the hidden state appropriately, it appears like a normal die. If your sample space is enlarged to include the hidden state, the entropy of each roll is reduced by one bit. You will not be able to distinguish between a truly random coin and a coin with a hidden state if your sample space is incomplete. Is this the point you were making?
On 2: Now let's imagine I can only observe whether the die comes up even or odd. This is a coarse-graining of the sample space (we get strictly less information - or, we only get some "macro" information). Of course, a coarse-grained sample space is necessarily an incomplete one! We can imagine comparing the outcomes from a normal die, to one which with equal probability rolls an even or odd number, except it cycles through the microstates deterministically e.g. equal chance of {odd, even}, but given that outcome, always goes to next in sequence {(1->3->5), (2->4->6)}.
Incomplete or coarse sample spaces can indeed prevent us from inferring the underlying dynamics. Many processes can have the same apparent entropy on our sample space from radically different underlying processes.
Right, this is exactly what I'm getting at - learning a distribution over a fixed sample space can be done with Bayesian methods, or entropy-based methods like the OP suggested, but I'm wondering if there are methods that can automatically adjust the sample space as well.
For well-defined mathematical problems like dice rolling and fixed classical mechanics scenarios and such, you don't need this I guess, but for any real-world problem I imagine half the problem is figuring out a good sample space to begin with. This kind of thing must have been studied already, I just don't know what to look for!
There are some analogies to algorithms like NEAT, which automatically evolves a neural network architecture while training. But that's obviously a very different context.
We could discuss completeness of the sample space, and we can also discuss completeness of the hypothesis space.
In Solomonoff Induction, which purports to be a theory of universal inductive inference, the "complete hypothesis space" consists of all computable programs (note that all current theories of physics are computable, so this hypothesis space is very general). Then induction is performed by keeping all programs consistent with the observations, weighted by 2 terms: the programs prior likelihood, and the probability that program assigns to the observations (the programs can be deterministic and assign probability 1).
The "prior likelihood" in Solomonoff Induction is the program's complexity (well, 2^(-Complexity), where the complexity is the length of the shortest representation of that program.
Altogether, the procedure looks like: maintain a belief which is a mixture of all programs consistent with the observations, weighted by their complexity and the likelihood they assign to the data. Of course, this procedure is still limited by the sample/observation space!
That's our best formal theory of induction in a nutshell.
This kind of thinking will lead you to ideas like algorithmic probability, where distributions are defined using universal Turing machines that could model anything.
Amazing! I had actually heard about solomonoff induction before but my brain didn't make the connection. Thanks for the shortcut =)
Couldn't you just add a control (PID/Kalman filter/etc) to coverage on a stability of some local "most" truth?
Could you elaborate? To be honest I have no idea what that means.
You can sort of do this over a suitably large (or infinite) family of models all mixed, but from an epistemological POV that’s pretty unsatisfying.
From a practical POV it’s pretty useful and common (if you allow it to describe non- and semi-parametric models too).
Trivial example: if you know the seed of a pseudo-random number generator, a sequence generated by it has very low entropy.
But if you don't know the seed, the entropy is very high.
Theoretically, it's still only the entropy of the sneed-space + time-space it could have been running in, right?
Shannon's entropy is a property of the source-channel-receiver system.
Can you explain this in more detail?
Entropy is calculated as a function of a probability distribution over possible messages or symbols. The sender might have a distribution P over possible symbols, and the receiver might have another distribution Q over possible symbols. Then the "true" distribution over possible symbols might be another distribution yet, call it R. The mismatch between these is what leads to various inefficiencies in coding, decoding, etc [1]. But both P and Q are beliefs about R -- that is, they are properties of observers.
[1] https://en.wikipedia.org/wiki/Kullback–Leibler_divergence#Co...
Baez has a video (accompanying, imho), with slides
https://m.youtube.com/watch?v=5phJVSWdWg4&t=17m
He illustrates the derivation of Shannon entropy with pictures of trees
https://archive.is/9vnVq
shannon entropy is subjective for bayesians and objective for frequentists
To shorten this for you with my own (identical) understanding: "entropy is just the name for the bits you don't have".
Entropy + Information = Total bits in a complete description.
It's an objective quantity, but you have to be very precise in stating what the quantity describes.
Unbroken egg? Low entropy. There's only one way the egg can exist in an unbroken state, and that's it. You could represent the state of the egg with a single bit.
Broken egg? High entropy. There are an arbitrarily-large number of ways that the pieces of a broken egg could land.
A list of the locations and orientations of each piece of the broken egg, sorted by latitude, longitude, and compass bearing? Low entropy again; for any given instance of a broken egg, there's only one way that list can be written.
Zip up the list you made? High entropy again; the data in the .zip file is effectively random, and cannot be compressed significantly further. Until you unzip it again...
Likewise, if you had to transmit the (uncompressed) list over a bandwidth-limited channel. The person receiving the data can make no assumptions about its contents, so it might as well be random even though it has structure. Its entropy is effectively high again.