How Much Data?

By | November 29, 2014

ocean waveWebopedia cites IDC research showing that we — presumably meaning humanity, all of civilization — produced 2.8 zettabytes in 2012. (That’s 2.8 trillion gigabytes, for those who couldn’t remember where “zetta” falls on the scale of hugeness.) In what may be a corallary to Moore’s Law, IDC also says that the total amount of data in the world doubles every 18 months and that we will therefore be at 40 zettabytes by 2020. Meanwhile, keeping it more businessy, Gartner projects that the total amount of enterprise data worldwide will increase 650% in the next five years.

Another fun way to look at data growth is to consider all the infrastructure required to support it. Steve Ballmer says that Microsoft — not exactly the first name you think of when you think of big data — has a million servers out there.

A million. Seven figures. The oldest stats I can find without, you know, really looking show that about 40 years ago, the total number of computers sold each year was 50,000. I doubt there were even a million computers in the world at that time. Now that’s how many computers one company owns.

Meanwhile, in order to provide an answer to a burning question about punch cards, XKCD has put together an estimate showing Google probably has somewhere between 1.8 and 2.4 million servers. And even they might not be the biggest. NSA might have more.

Which does raise an interesting question: why would it take the NSA more servers to catalog all of my personal data than it does Google? Must be government inefficiency rearing its ugly head.

How much data? The short answer is a LOT. I was just writing about the race “between the explosive growth of data that the Internet of Things and other big data drivers are bringing about, and the substantial reduction that columnar databases, in-memory processing, and other technological developments can bring about.”

As it stands now, I would say that big data has a growing, but perhaps not yet insurmountable, lead in that race. Data volumes are, in a sense, relative. I can remember when a megabyte was a lot of data. Today, not so much. Our capacity to store and access data effectively shrinks it.

And there is something even more important at work:

The data flow so fast that the total accumulation of the past two years…dwarfs the prior record of human civilization. “There is a big data revolution,” says Weatherhead University Professor Gary King. But it is not the quantity of data that is revolutionary. “The big data revolution is that now we can do something with the data.”

The revolution lies in improved statistical and computational methods, not in the exponential growth of storage or even computational capacity, King explains. The doubling of computing power every 18 months (Moore’s Law) “is nothing compared to a big algorithm”—a set of rules that can be used to solve a problem a thousand times faster than conventional computational methods could. One colleague, faced with a mountain of data, figured out that he would need a $2-million computer to analyze it. Instead, King and his graduate students came up with an algorithm within two hours that would do the same thing in 20 minutes—on a laptop: a simple example, but illustrative.

Now that is doing more with less!

So the question we should be asking is maybe not so much “How much data is there?” but rather “How much data can we use effectively?” or even better “How much more value are we deriving from data — any amount of data — than we did before?” The growth curves that answer those two questions are the real story of big data.

  • stevebang

    Goethe said this in 1810, but it applies even more today:

    “The modern age has a false sense of superiority, because of the great mass of data at its disposal. But the valid criterion of distinction is rather the extent to which man knows how to form and master the material at his command.”

  • Stephen Kagan

    The Sumerians had cuneiform tablets that we unearth in archeological digs. Will there be a the future where our ancestors excavate down to our primitive strata of data objects?