**Big data**

A petabyte is a lot of information. But how many petabytes does it take to

*completely describe one gram of water?*

Let's see:

A

**bit**is the information in one binary decision — a no or yes, a 0 or 1.

• 5 bits: approximate information in one letter of the Roman alphabet.

A

**byte**is 8 bits.

A

**kilobyte**is about a thousand bytes (actually 1024 of them).

• 2 kilobytes: a typewritten page.

• 100 kilobytes: a low-resolution photograph.

A

**megabyte**is about a million bytes.

• 1 megabyte: a small novel or a 3.5 inch floppy disk.

• 2 megabytes: a high-resolution photograph.

• 5 megabytes: the complete works of Shakespeare.

• 500 megabytes: a CD-ROM.

A

**gigabyte**is about a billion bytes.

• 1.25 gigabytes: the human genome, or a pickup truck full of books.

• 20 gigabytes: a good collection of the works of Beethoven.

• 100 gigabytes: a library floor of academic journals.

A

**terabyte**is about a trillion bytes.

• 2 terabytes: an academic research library.

• 6 terabytes: all academic journals printed in 2002.

• 10 terabytes: the print collections of the U.S. Library of Congress.

• 40 terabytes: all books printed in 2002.

• 60 terabytes: all audio CDs released in 2002.

• 80 terabytes: capacity of all floppy discs produced in 2002.

• 140 terabytes: all newspapers printed in 2002.

• 170 terabytes: the searchable part of the World-Wide Web in 2002.

• 250 terabytes: capacity of all zip drives produced in 2002.

A

**petabyte**is about 10^15 bytes.

• 1.5 petabytes: all office documents generated in 2002.

• 2 petabytes: all U.S academic research libraries.

• 6 petabytes: all cinema release films in 2002.

• 90 petabytes: the "Deep Web" in 2002

• 130 petabytes: capacity of all audio tapes produced in 2002.

• 400 petabytes: all photographs taken in 2002.

• 440 petabytes: all emails sent in 2002.

An

**exabyte**is about 10^18 bytes.

• 1.3 exabytes: capacity of all videotapes produced in 2002.

• 2 exabytes: capacity of all hard disks produced in 2002.

• 5 exabytes: all the words ever spoken by human beings.

• 9 exabytes: all the genomes of every living person in 2015.

A

**zettabyte**is about 10^21 bytes.

• 500 zettabytes: the information needed to completely describe the state of a gram of water at room temperature.

So, the answer is:

It takes 500,000,000 petabytes to completely describe one gram of water, down to the positions and velocities of the individual subatomic particles...

... limited, of course, by the Heisenberg uncertainty principle! That's what makes the amount of information

*finite*.

How can we calculate this? It sounds hard, but it's not if you look up a few numbers.

First of all, the entropy of water! At room temperature (25 degrees Celsius) and normal pressure (1 atmosphere), the entropy of a mole of water is 69.91 joules per kelvin.

To understand this, first you need to know that chemists like moles — and by a 'mole', I don't mean that fuzzy creature that ruins your lawn: I mean a certain ridiculously large number of molecules or atoms, invented to deal with the fact that even a tiny little thing is made of lots of atoms. By definition, a

**mole**is about the number of atoms in one gram of hydrogen.

A guy named Avogadro figured out that this number is about 6.023 × 10^23. People now call this

**Avogadro's number**. So, a mole of water means 6.023 × 10^23 molecules of water. And since a water molecule is 18 times heavier than a hydrogen atom, this is 18 grams of water.

So, if we prefer grams to moles, the entropy of a gram of water is is 69.91/18 = 3.88 joules per kelvin. By the way, I don't want to explain why entropy is measured in joules per kelvin — that's another fun story.

But what does all this have to do with information? Well, Boltzmann, Shannon and others figured out how entropy and information are related, and the formula is pretty simple: one nat of information equals 1.3808 × 10^(-23) joules per kelvin of entropy. This number is called

**Boltzmann's constant**.

What's a 'nat' of information? Well, bits of information are a good unit when you're using binary notation — 0's and 1's — but trits would be a good unit if you were using base 3, and so on. For physics the most natural unit is a

**nat**, where we use base e. So, 'nat' stands for 'natural'.

Don't get in a snit over the fact that we can't actually write numbers using base e — if you do, I'll just say you're nitpicking, or natpicking! The point is, information in the physical world is not binary — so base e turns out to be the best.

Okay: so, by taking the reciprocal of Boltzmann's constant we see that one joule per kelvin of entropy equals 7.24 × 10^22 nats of information.

That's all we need to look up. We can now just multiply and see that a gram of water (at room temperature and pressure) holds

3.88 × 7.24 × 10^23 = 2.81 × 10^24 nats

of information. In other words, this is how much information it takes to completely specify the state of one gram of water.

Or if you prefer bits, use the fact that a bit equals ln(2) or .693 nats. Dividing by this, we see a gram of water holds

4.05 × 10^24 bits

of information. And amazingly, this is something we know quite precisely! I've rounded off the numbers, but we could actually work it out to more decimal places if we wanted.

If you want to learn more about this, study

**statistical mechanics**- that's where physics meets information theory.

A bunch of my figures came from here:

• Peter Lyman, Hal R. Varian, Kirsten Swearingen,

*et al*, How much information? 2003, http://www.sims.berkeley.edu/research/projects/how-much-info-2003/

and the chart originally came from here:

http://mozy.com/blog/misc/how-much-is-a-petabyte/

though it was edited by folks at Gizmodo. All this stuff and more is on this page of mine:

http://math.ucr.edu/home/baez/information.html

#informationtheory #bigness

View 125 previous comments

- You're right +Pol Nasam, that low entropy states are neither interesting. They're trivially predictable and provide negligible information.

What I mean by structure can be formally (enough) stated as those regularities that make compression (or work in physical systems) possible. Higher entropy states are unlearnable because there is no structure to capture/leverage.

----

+John Baez I find Kolmogorov complexity to be a more useful and flexible notion than entropy and one can easily get from one to the other.

I was going to mention logical depth too. I found it tricky to understand but naively, ignoring issues of stability (a good definition is stable to changes in program size), it mainly says that interesting things should take longer to compute when compared to their length, with time lower bounded by length.

Interesting things have both intermediate algorithmic probability (related to its Komplexity) as well as high computational complexity for all programs that can be reasonably found with good probability. So while a human has lower descriptive complexity than an onion, they take far more resources to compute and are hence more interesting than onions.

-------

Interestingly +Toby Bartels, regarding learnability, it seems most functions are either easily learned or not interesting.

> This perspective leads to conclusions which are at odds with common beliefs regarding clustering. This applies, in particular, to the computational hardness of clustering.

> The infeasibility of optimizing most of the popular objectives led many theoreticians, to the bleak view that clustering is hard. However, we show that in many circumstances a good clustering can be efficiently found, leading to the opposite conclusion.

> From the practitioner's viewpoint, "clustering is either easy or pointless" -- that is, whenever the input admits a good clustering, finding it is feasible. Our analysis provides some support to this view.

> This work is one of several recent attempts to develop a mathematical theory of clustering. For more on the relevant literature, see Section 4.

_Clustering is difficult only when it does not matter_

No matter how intelligent you are, you're still going to be stumped by mean NP-hard problems.Apr 14, 2015 - +Deen Abiola "[...]we show that in many circumstances a good clustering can be efficiently found, leading to the opposite conclusion."

Who is "we"? Are these quotations from a paper of yours? That of others'? Any link?Apr 15, 2015 - Nice ideas. I think entropy is energy lost within closed systems. Data is the information produced by entropy, telling physical aspects of the system. Maybe, entropy in open systems is the information about interaction of dissipative systems. Maybe, at macroscopic level there is no entropy. Therefore there is no information about the whole system...Jun 3, 2015
- +Rafael Melgarejo - entropy has different units than energy, so it can't be true that "entropy is energy lost..." - that would be like "momentum is energy that..."Jun 4, 2015
- ... And, does information have the same units than entropy?Jun 5, 2015
- +Rafael Melgarejo - no, but people convert from information to entropy simply by multiplying by a constant called Boltzmann's constant, so they're not very different (except that entropy refers to information you
*don't*know). Boltzmann's constant has units of energy/temperature. Information is dimensionless (it's measured in bits). Entropy has units of energy/temperature. So, the way you convert between entropy and energy depends on the temperature.Jun 5, 2015

Related Collections