**Big data**

A petabyte is a lot of information. But how many petabytes does it take to

*completely describe one gram of water?*

Let's see:

A

**bit**is the information in one binary decision — a no or yes, a 0 or 1.

• 5 bits: approximate information in one letter of the Roman alphabet.

A

**byte**is 8 bits.

A

**kilobyte**is about a thousand bytes (actually 1024 of them).

• 2 kilobytes: a typewritten page.

• 100 kilobytes: a low-resolution photograph.

A

**megabyte**is about a million bytes.

• 1 megabyte: a small novel or a 3.5 inch floppy disk.

• 2 megabytes: a high-resolution photograph.

• 5 megabytes: the complete works of Shakespeare.

• 500 megabytes: a CD-ROM.

A

**gigabyte**is about a billion bytes.

• 1.25 gigabytes: the human genome, or a pickup truck full of books.

• 20 gigabytes: a good collection of the works of Beethoven.

• 100 gigabytes: a library floor of academic journals.

A

**terabyte**is about a trillion bytes.

• 2 terabytes: an academic research library.

• 6 terabytes: all academic journals printed in 2002.

• 10 terabytes: the print collections of the U.S. Library of Congress.

• 40 terabytes: all books printed in 2002.

• 60 terabytes: all audio CDs released in 2002.

• 80 terabytes: capacity of all floppy discs produced in 2002.

• 140 terabytes: all newspapers printed in 2002.

• 170 terabytes: the searchable part of the World-Wide Web in 2002.

• 250 terabytes: capacity of all zip drives produced in 2002.

A

**petabyte**is about 10^15 bytes.

• 1.5 petabytes: all office documents generated in 2002.

• 2 petabytes: all U.S academic research libraries.

• 6 petabytes: all cinema release films in 2002.

• 90 petabytes: the "Deep Web" in 2002

• 130 petabytes: capacity of all audio tapes produced in 2002.

• 400 petabytes: all photographs taken in 2002.

• 440 petabytes: all emails sent in 2002.

An

**exabyte**is about 10^18 bytes.

• 1.3 exabytes: capacity of all videotapes produced in 2002.

• 2 exabytes: capacity of all hard disks produced in 2002.

• 5 exabytes: all the words ever spoken by human beings.

• 9 exabytes: all the genomes of every living person in 2015.

A

**zettabyte**is about 10^21 bytes.

• 500 zettabytes: the information needed to completely describe the state of a gram of water at room temperature.

So, the answer is:

It takes 500,000,000 petabytes to completely describe one gram of water, down to the positions and velocities of the individual subatomic particles...

... limited, of course, by the Heisenberg uncertainty principle! That's what makes the amount of information

*finite*.

How can we calculate this? It sounds hard, but it's not if you look up a few numbers.

First of all, the entropy of water! At room temperature (25 degrees Celsius) and normal pressure (1 atmosphere), the entropy of a mole of water is 69.91 joules per kelvin.

To understand this, first you need to know that chemists like moles — and by a 'mole', I don't mean that fuzzy creature that ruins your lawn: I mean a certain ridiculously large number of molecules or atoms, invented to deal with the fact that even a tiny little thing is made of lots of atoms. By definition, a

**mole**is about the number of atoms in one gram of hydrogen.

A guy named Avogadro figured out that this number is about 6.023 × 10^23. People now call this

**Avogadro's number**. So, a mole of water means 6.023 × 10^23 molecules of water. And since a water molecule is 18 times heavier than a hydrogen atom, this is 18 grams of water.

So, if we prefer grams to moles, the entropy of a gram of water is is 69.91/18 = 3.88 joules per kelvin. By the way, I don't want to explain why entropy is measured in joules per kelvin — that's another fun story.

But what does all this have to do with information? Well, Boltzmann, Shannon and others figured out how entropy and information are related, and the formula is pretty simple: one nat of information equals 1.3808 × 10^(-23) joules per kelvin of entropy. This number is called

**Boltzmann's constant**.

What's a 'nat' of information? Well, bits of information are a good unit when you're using binary notation — 0's and 1's — but trits would be a good unit if you were using base 3, and so on. For physics the most natural unit is a

**nat**, where we use base e. So, 'nat' stands for 'natural'.

Don't get in a snit over the fact that we can't actually write numbers using base e — if you do, I'll just say you're nitpicking, or natpicking! The point is, information in the physical world is not binary — so base e turns out to be the best.

Okay: so, by taking the reciprocal of Boltzmann's constant we see that one joule per kelvin of entropy equals 7.24 × 10^22 nats of information.

That's all we need to look up. We can now just multiply and see that a gram of water (at room temperature and pressure) holds

3.88 × 7.24 × 10^23 = 2.81 × 10^24 nats

of information. In other words, this is how much information it takes to completely specify the state of one gram of water.

Or if you prefer bits, use the fact that a bit equals ln(2) or .693 nats. Dividing by this, we see a gram of water holds

4.05 × 10^24 bits

of information. And amazingly, this is something we know quite precisely! I've rounded off the numbers, but we could actually work it out to more decimal places if we wanted.

If you want to learn more about this, study

**statistical mechanics**- that's where physics meets information theory.

A bunch of my figures came from here:

• Peter Lyman, Hal R. Varian, Kirsten Swearingen,

*et al*, How much information? 2003, http://www.sims.berkeley.edu/research/projects/how-much-info-2003/

and the chart originally came from here:

http://mozy.com/blog/misc/how-much-is-a-petabyte/

though it was edited by folks at Gizmodo. All this stuff and more is on this page of mine:

http://math.ucr.edu/home/baez/information.html

#informationtheory #bigness

View 126 previous comments

- +Deen Abiola "[...]we show that in many circumstances a good clustering can be efficiently found, leading to the opposite conclusion."

Who is "we"? Are these quotations from a paper of yours? That of others'? Any link?Apr 15, 2015 - Nice ideas. I think entropy is energy lost within closed systems. Data is the information produced by entropy, telling physical aspects of the system. Maybe, entropy in open systems is the information about interaction of dissipative systems. Maybe, at macroscopic level there is no entropy. Therefore there is no information about the whole system...Jun 3, 2015
- +Rafael Melgarejo - entropy has different units than energy, so it can't be true that "entropy is energy lost..." - that would be like "momentum is energy that..."Jun 4, 2015
- ... And, does information have the same units than entropy?Jun 5, 2015
- +Rafael Melgarejo - no, but people convert from information to entropy simply by multiplying by a constant called Boltzmann's constant, so they're not very different (except that entropy refers to information you
*don't*know). Boltzmann's constant has units of energy/temperature. Information is dimensionless (it's measured in bits). Entropy has units of energy/temperature. So, the way you convert between entropy and energy depends on the temperature.Jun 5, 2015 - +Tristan Steven - if there are 14,106 languages in total, it would take only log_2(14,106) < 14 bits to say which language was being used before listing all written works in that specific language, a fairly small overhead. Most of the bits would be spent encoding the texts of each specific language in binary.16w

Add a comment...