Monthly Archives: November 2014

How Much Data?

ocean waveWebopedia cites IDC research showing that we — presumably meaning humanity, all of civilization — produced 2.8 zettabytes in 2012. (That’s 2.8 trillion gigabytes, for those who couldn’t remember where “zetta” falls on the scale of hugeness.) In what may be a corallary to Moore’s Law, IDC also says that the total amount of data in the world doubles every 18 months and that we will therefore be at 40 zettabytes by 2020. Meanwhile, keeping it more businessy, Gartner projects that the total amount of enterprise data worldwide will increase 650% in the next five years.

Another fun way to look at data growth is to consider all the infrastructure required to support it. Steve Ballmer says that Microsoft — not exactly the first name you think of when you think of big data — has a million servers out there.

A million. Seven figures. The oldest stats I can find without, you know, really looking show that about 40 years ago, the total number of computers sold each year was 50,000. I doubt there were even a million computers in the world at that time. Now that’s how many computers one company owns.

Meanwhile, in order to provide an answer to a burning question about punch cards, XKCD has put together an estimate showing Google probably has somewhere between 1.8 and 2.4 million servers. And even they might not be the biggest. NSA might have more.

Which does raise an interesting question: why would it take the NSA more servers to catalog all of my personal data than it does Google? Must be government inefficiency rearing its ugly head.

How much data? The short answer is a LOT. I was just writing about the race “between the explosive growth of data that the Internet of Things and other big data drivers are bringing about, and the substantial reduction that columnar databases, in-memory processing, and other technological developments can bring about.”

As it stands now, I would say that big data has a growing, but perhaps not yet insurmountable, lead in that race. Data volumes are, in a sense, relative. I can remember when a megabyte was a lot of data. Today, not so much. Our capacity to store and access data effectively shrinks it.

And there is something even more important at work:

The data flow so fast that the total accumulation of the past two years…dwarfs the prior record of human civilization. “There is a big data revolution,” says Weatherhead University Professor Gary King. But it is not the quantity of data that is revolutionary. “The big data revolution is that now we can do something with the data.”

The revolution lies in improved statistical and computational methods, not in the exponential growth of storage or even computational capacity, King explains. The doubling of computing power every 18 months (Moore’s Law) “is nothing compared to a big algorithm”—a set of rules that can be used to solve a problem a thousand times faster than conventional computational methods could. One colleague, faced with a mountain of data, figured out that he would need a $2-million computer to analyze it. Instead, King and his graduate students came up with an algorithm within two hours that would do the same thing in 20 minutes—on a laptop: a simple example, but illustrative.

Now that is doing more with less!

So the question we should be asking is maybe not so much “How much data is there?” but rather “How much data can we use effectively?” or even better “How much more value are we deriving from data — any amount of data — than we did before?” The growth curves that answer those two questions are the real story of big data.

An Evolutionary Approach

HybridSpeciesWhile my recent observation that Data Is Eating Us may have come off as tongue-in-cheek, the reality behind it is no joke. Most people aren’t (yet) transforming their basic bodily functions in order to have more time to analyze data, but there is no question that the fundamental dynamic between human beings and data is changing rapidly. Writing at Forbes, Teradata’s Oliver Ratzesberger explains why:

Most computational neuroscientists estimate that the human brain’s storage capacity is somewhere between 10 and 100 terabytes. Compare that to a worldwide data explosion – already at more than 1.8 trillion gigabytes and doubling every two years – and you begin to understand the analytics “pain points” our industry is grappling with.

For one thing, we spend the majority of our time just sifting through data instead of making decisions. We’re constantly on our heels in reaction mode, putting out fires instead of thinking about the future. And we can’t seem to make decisions fast enough, given that our brains don’t scale the way data can. [Emphasis added.]

Exactly. It is that difference not only in scale but in scalability that has kicked off the entire big data movement / phenomenon / whatever-you-want-to-call-it. After all, what do we mean by “big” data? We mean data that is bigger than…

  1. …we expected.
  2. …we were ready for.
  3. …we know what to do with.

The three (or four, or however many) V’s of big data are all about this core difference. Data volumes expand beyond our storage and handling capacity; data velocity outpaces our ability to respond to it, much less deal with it proactively; data variety confounds not only our existing systems, but our core business processes and the concepts they are built on.

Even if data isn’t eating us, it is outgrowing us. In response, we try to keep up and, if possible, get ahead. A dazzling array of new approaches, new technologies, and new players in the field offer hope, but will they be enough? How do we counter that fundamental difference in scalability?

Oliver’s answer to this, along with Dr. Mohan Sawhney of the Kellogg School of Management, is a new approach called the Sentient Enterprise:

The Sentient Enterprise is an enterprise that can listen to data, conduct analysis and make autonomous decisions at massive scale in real-time. The Sentient Enterprise can listen to data to sense micro-trends. It can act as one organism without being impeded by information silos. It can make autonomous decisions with little or no human intervention. It is always evolving, with emergent intelligence that becomes progressively more sophisticated.

 No, this is not “I, for one, welcome our new robot overlords.” At least not exactly. It is more along the lines of  ”If you can’t beat ‘em, join ‘em.”

It’s an evolutionary approach. If the Sentient Enterprise is an organism, it represents a new species formed by the symbiosis of two separate species. Yes, that happens sometimes. But this is  unlike, say,  having two closely related species of  fly producing an exciting new species of  fly,  or the proposed merger of grizzly bears and polar bears that you may have read about. The emergence of the Sentient Enterprise represents a much more fundamental shift. In evolutionary biology, there is a theory called Symbiogenesis, which states that early single-celled organisms merged into more complex cellular structures that eventually allowed for the development of the plants and animals we have today. (Animals incorporated mitochondria, while plants merged with chloroplasts.)

Symbiogenesis is one of the biggest milestones in the history of life. Had it not occurred, all life on earth would pretty much be variants on bacteria.

In the case of the sentient enterprise, the two organisms that are merging are the humans with their non-scalable brains and the whole infrastructure for managing the organization’s data, which includes the data itself. Of course we’re not literally merging with those systems in the same way that we literally have mitochondria embedded in every cell in our bodies (at least not yet), but I think that it is safe to say that when Oliver and Mohan describe the Sentient Enterprise as an organism (or for that matter, when they say that it is sentient), they are engaging in more than just analogy.

As I have noted recently, it is likewise more than just an analogy to say that data is transforming our world and that its reach beyond the realm of the abstract into the physical world is becoming increasingly significant. The Sentient Enterprise could well become one of the focal points for this ongoing (indeed, accelerating) transformation.

For more background, here is Oliver telling the whole Sentient Enterprise story:

The (Shrinking) Growing Data Footprint

footprintAt the recent SAP Teched && decode in Berlin (and, no — for those unfamiliar, there are no typos in my presentation of the event’s name) Bernd Leukert, a member of the executive board of SAP SE Products and Innovation, led a keynote session touching on several of the themes I have been writing about here recently. Using as a guide Nicholas Negroponte’s vision as outlined in his book Being Digital (1996), Leukert makes the case that we are, indeed, in transition “from a world made out of stuff to a world made out of data…and stuff” — to quote my recent re-articulation of Negroponte’s basic idea.

And he adds an interesting wrinkle.

Where I have been making the case that, part and parcel with the big data phenomenon, the data footprint of real-world stuff is growing exponentially (even as “stuff” as we know it becomes smaller and less substantial in almost every other regard) Leukert makes the case that new in-memory database technologies are actually going to shrink the data footprints of businesses, by eliminating data indices and aggregates.

On the one hand, this is hardly an unfamiliar argument. Before in-memory was a thing, the columnar database vendors made very similar claims for data warehouses running on, say Vertica or Sybase IQ. With a columnar database, the argument went, you could make any query into the data and get an answer back fast without having to create all these copies of the data, which ultimately is what summaries, indexes, aggregates, cubes and even data marts are. So your data warehouse could become a lot smaller and a lot faster. Win win!

Now Leukert expands that argument, referring to SAP HANA environments, showing how the reduction of indices and aggregates from both the operational and analytical data within an organization can lead to a significant reduction of the overall enterprise data footprint. He gives the example of a typical financial booking which updates 15 separate database records, showing that in a simplified in-memory enterprise environment that number can be reduced to four database records. He goes on to claim that SAP has itself managed a 14X reduction in data footprint, with a 30X reduction expected overall.

Those are massive reductions, and should map to massive savings. And so an interesting race begins, between the explosive growth of data that the Internet of Things and other big data drivers are bringing about, and the substantial reduction that columnar databases, in-memory processing, and other technological developments can bring about. Which side will win? Maybe we really can do more with less. Or maybe these technologies simply help to curb the otherwise uncontrollable growth of big data.

Stay tuned.

Oh, here is Leukert (and associates’) entire talk. Most of the stuff about Negroponte is at the beginning, but he does come back to it at least once in the middle and then again at the end. At nearly an hour and 45 minutes, this is not as trying on the patience as many keynotes I have endured. Plus if you watch the whole thing, you’ll learn about the Internet of Toilets.

And, no, I’m not making that up!

Data Is Eating Us

Green smoothieIs analyzing big data more fun than eating? Well, it might just be. For some, at least.

Anyway, that is one of the premises of Platfora’s recent Soylent giveaway promotion. For those who need catching up: Platfora is a Hadoop-native big data analytics platform. Soylent is an instant meal replacement, designed to provide 100% of the body’s nutritional requirements while doing away with all that distracting and time-consuming “eating” that humans are compelled to keep doing. Where these two meet is in the lives of busy data scientists and hard-core analysts. As the Platfora blog explains it:

Hunger demands that you go right then and heat up that frozen burrito immediately. No get out of jail free card for you, my friend.

What if I told you there was another way? A magical way to throw off the Shackles of Mealtime and the depression of time-sucking, sad cafeteria lunches. A way to be free to revel in the world of data limitlessly, without the constraints of a growling stomach and hungry mind.

And that “way,” of course, is the consumption of Soylent rather than the burrito. To quote the promotional video from the Soylent home page:

Unlike most other foods which prioritize taste and texture, Soylent was engineered to maximize nutrition, to nourish the body in the most efficient way possible.

No shopping, no cooking, no figuring out what goes with what or  worrying about whether you’re keeping things in balance. In the video, we meet the creator of Soylent, an engineer who has taken on human nutrition as an engineering problem, one that can be broken down to its constituent parts. In this case, the “parts” that make up nutrition are chemicals. So the solution to human nutrition is ultimately a formula.

It’s another brilliant example of datafication in action. Previously we saw how a jet engine could become smaller, cleaner, quieter, and more powerful through changing the relationship between its physical components and its data component.  And we  looked at the almost magical process that can transform a room full of devices into a single device that fits in the palm of your hand: a smart phone. Now apply that same magic to  one of the fundamental physical processes of human survival, and voila! Soylent.

But the Platfora promotion takes it even further than that. Why datify the process of eating? One obvious reason: so you can spend more time working with data.

In the movie from which Soylent takes its name, the surprise ending (spoilers ahead) is that people are eating other people. Yikes, that’s terrifying. But that isn’t what’s happening here. In the world of Soylent, people are eating data — or at least food that leverages the maximum value of its data component.

At the same time, it is becoming increasingly apparent that  data is eating us. (Some might say that software is eating us, but I say same difference.) Or if not eating us, it is at least getting the upper hand in the relationship. Here we have data maximizing the efficiency of a core human bodily function so that we might better attend to data and its needs.

Sure, a lot of people will tell you that they have no interest in using Soylent. And even among Soylent users, freeing up time to allow for more data analysis is only one of many motivations. Yes, the data is working for us. But increasingly, it seems to have us working for it. Who exactly is running this show?

Stay tuned.