The Speculist » Big Data

Where the Possibilities Are

Phil Bowermaster — Wed, 28 Jan 2015 22:36:37 +0000

Where does the value of big data truly present itself, in the data itself or in the algorithms we use to make sense of it? Bill Franks of Teradata comes down sharply on the side of the data:

…I’m convinced that new information will beat new algorithms and new metrics based on existing information almost every time. Indeed, new information can be so powerful that, once it is found, analytics professionals should stop worrying about improving existing models with existing data and focus instead on incorporating and testing that new information.

By “new information,” he means information that didn’t exist before or that we now have to a level of depth never before possible. Sensor data in Internet of Things environments can represent either of these kinds of data. For example, we may have always used temperature data in performing some calculation, but back in the day we used a daily average. Now we have sensors providing temperature data every few minutes (or seconds.) That’s data to a greater depth. For data that we didn’t have before, Bill cites sensors on cars that track wear and tear as the vehicle is driven. Previously, vehicle repair occurred in a primarily reactive way. Now we can begin to anticipate repairs before they are needed.

Somehow this reminds me of a talk that Eliezer Yudkowsky gave at the Singularity Summit back in 2007. He said:

In the intelligence explosion the key threshold is criticality of recursive self-improvement. It’s not enough to have an AI that improves itself a little. It has to be able to improve itself enough to significantly increase its ability to make further self-improvements, which sounds to me like a software issue, not a hardware issue. So there is a question of, Can you predict that threshold using Moore’s Law at all?

Geordie Rose of D-Wave Systems recently was kind enough to provide us with a startling illustration of software progress versus hardware progress. Suppose you want to factor a 75-digit number. Would you rather have a 2007 supercomputer, IBM’s Blue Gene/L, running an algorithm from 1977, or a 1977 computer, an Apple II, running a 2007 algorithm? And Geordie Rose calculated that Blue Gene/L with 1977′s algorithm would take ten years, and an Apple II with 2007′s algorithm would take three years.

There is a progression here, albeit a counter-intuitive one. We might be inclined to think that hardware adds more value than “mere” software and that software is inherently more valuable than “mere” (or the term we like to throw around a lot is “raw”) data. The opposite turns out to be the truth. The data itself is where the value is. Hardware and software only help us to focus on the potentialities, the possibilities, that it already contains.

How Much Data?

Phil Bowermaster — Sat, 29 Nov 2014 22:32:50 +0000

Webopedia cites IDC research showing that we — presumably meaning humanity, all of civilization — produced 2.8 zettabytes in 2012. (That’s 2.8 trillion gigabytes, for those who couldn’t remember where “zetta” falls on the scale of hugeness.) In what may be a corallary to Moore’s Law, IDC also says that the total amount of data in the world doubles every 18 months and that we will therefore be at 40 zettabytes by 2020. Meanwhile, keeping it more businessy, Gartner projects that the total amount of enterprise data worldwide will increase 650% in the next five years.

Another fun way to look at data growth is to consider all the infrastructure required to support it. Steve Ballmer says that Microsoft — not exactly the first name you think of when you think of big data — has a million servers out there.

A million. Seven figures. The oldest stats I can find without, you know, really looking show that about 40 years ago, the total number of computers sold each year was 50,000. I doubt there were even a million computers in the world at that time. Now that’s how many computers one company owns.

Meanwhile, in order to provide an answer to a burning question about punch cards, XKCD has put together an estimate showing Google probably has somewhere between 1.8 and 2.4 million servers. And even they might not be the biggest. NSA might have more.

Which does raise an interesting question: why would it take the NSA more servers to catalog all of my personal data than it does Google? Must be government inefficiency rearing its ugly head.

How much data? The short answer is a LOT. I was just writing about the race “between the explosive growth of data that the Internet of Things and other big data drivers are bringing about, and the substantial reduction that columnar databases, in-memory processing, and other technological developments can bring about.”

As it stands now, I would say that big data has a growing, but perhaps not yet insurmountable, lead in that race. Data volumes are, in a sense, relative. I can remember when a megabyte was a lot of data. Today, not so much. Our capacity to store and access data effectively shrinks it.

And there is something even more important at work:

The data flow so fast that the total accumulation of the past two years…dwarfs the prior record of human civilization. “There is a big data revolution,” says Weatherhead University Professor Gary King. But it is not the quantity of data that is revolutionary. “The big data revolution is that now we can do something with the data.”

The revolution lies in improved statistical and computational methods, not in the exponential growth of storage or even computational capacity, King explains. The doubling of computing power every 18 months (Moore’s Law) “is nothing compared to a big algorithm”—a set of rules that can be used to solve a problem a thousand times faster than conventional computational methods could. One colleague, faced with a mountain of data, figured out that he would need a $2-million computer to analyze it. Instead, King and his graduate students came up with an algorithm within two hours that would do the same thing in 20 minutes—on a laptop: a simple example, but illustrative.

Now that is doing more with less!

So the question we should be asking is maybe not so much “How much data is there?” but rather “How much data can we use effectively?” or even better “How much more value are we deriving from data — any amount of data — than we did before?” The growth curves that answer those two questions are the real story of big data.

An Evolutionary Approach

Phil Bowermaster — Tue, 25 Nov 2014 17:35:37 +0000

While my recent observation that Data Is Eating Us may have come off as tongue-in-cheek, the reality behind it is no joke. Most people aren’t (yet) transforming their basic bodily functions in order to have more time to analyze data, but there is no question that the fundamental dynamic between human beings and data is changing rapidly. Writing at Forbes, Teradata’s Oliver Ratzesberger explains why:

Most computational neuroscientists estimate that the human brain’s storage capacity is somewhere between 10 and 100 terabytes. Compare that to a worldwide data explosion – already at more than 1.8 trillion gigabytes and doubling every two years – and you begin to understand the analytics “pain points” our industry is grappling with.

For one thing, we spend the majority of our time just sifting through data instead of making decisions. We’re constantly on our heels in reaction mode, putting out fires instead of thinking about the future. And we can’t seem to make decisions fast enough, given that our brains don’t scale the way data can. [Emphasis added.]

Exactly. It is that difference not only in scale but in scalability that has kicked off the entire big data movement / phenomenon / whatever-you-want-to-call-it. After all, what do we mean by “big” data? We mean data that is bigger than…

…we expected.
…we were ready for.
…we know what to do with.

The three (or four, or however many) V’s of big data are all about this core difference. Data volumes expand beyond our storage and handling capacity; data velocity outpaces our ability to respond to it, much less deal with it proactively; data variety confounds not only our existing systems, but our core business processes and the concepts they are built on.

Even if data isn’t eating us, it is outgrowing us. In response, we try to keep up and, if possible, get ahead. A dazzling array of new approaches, new technologies, and new players in the field offer hope, but will they be enough? How do we counter that fundamental difference in scalability?

Oliver’s answer to this, along with Dr. Mohan Sawhney of the Kellogg School of Management, is a new approach called the Sentient Enterprise:

The Sentient Enterprise is an enterprise that can listen to data, conduct analysis and make autonomous decisions at massive scale in real-time. The Sentient Enterprise can listen to data to sense micro-trends. It can act as one organism without being impeded by information silos. It can make autonomous decisions with little or no human intervention. It is always evolving, with emergent intelligence that becomes progressively more sophisticated.

No, this is not “I, for one, welcome our new robot overlords.” At least not exactly. It is more along the lines of ”If you can’t beat ‘em, join ‘em.”

It’s an evolutionary approach. If the Sentient Enterprise is an organism, it represents a new species formed by the symbiosis of two separate species. Yes, that happens sometimes. But this is unlike, say, having two closely related species of fly producing an exciting new species of fly, or the proposed merger of grizzly bears and polar bears that you may have read about. The emergence of the Sentient Enterprise represents a much more fundamental shift. In evolutionary biology, there is a theory called Symbiogenesis, which states that early single-celled organisms merged into more complex cellular structures that eventually allowed for the development of the plants and animals we have today. (Animals incorporated mitochondria, while plants merged with chloroplasts.)

Symbiogenesis is one of the biggest milestones in the history of life. Had it not occurred, all life on earth would pretty much be variants on bacteria.

In the case of the sentient enterprise, the two organisms that are merging are the humans with their non-scalable brains and the whole infrastructure for managing the organization’s data, which includes the data itself. Of course we’re not literally merging with those systems in the same way that we literally have mitochondria embedded in every cell in our bodies (at least not yet), but I think that it is safe to say that when Oliver and Mohan describe the Sentient Enterprise as an organism (or for that matter, when they say that it is sentient), they are engaging in more than just analogy.

As I have noted recently, it is likewise more than just an analogy to say that data is transforming our world and that its reach beyond the realm of the abstract into the physical world is becoming increasingly significant. The Sentient Enterprise could well become one of the focal points for this ongoing (indeed, accelerating) transformation.

—

For more background, here is Oliver telling the whole Sentient Enterprise story:

The (Shrinking) Growing Data Footprint

Phil Bowermaster — Sat, 22 Nov 2014 19:14:54 +0000

At the recent SAP Teched && decode in Berlin (and, no — for those unfamiliar, there are no typos in my presentation of the event’s name) Bernd Leukert, a member of the executive board of SAP SE Products and Innovation, led a keynote session touching on several of the themes I have been writing about here recently. Using as a guide Nicholas Negroponte’s vision as outlined in his book Being Digital (1996), Leukert makes the case that we are, indeed, in transition “from a world made out of stuff to a world made out of data…and stuff” — to quote my recent re-articulation of Negroponte’s basic idea.

And he adds an interesting wrinkle.

Where I have been making the case that, part and parcel with the big data phenomenon, the data footprint of real-world stuff is growing exponentially (even as “stuff” as we know it becomes smaller and less substantial in almost every other regard) Leukert makes the case that new in-memory database technologies are actually going to shrink the data footprints of businesses, by eliminating data indices and aggregates.

On the one hand, this is hardly an unfamiliar argument. Before in-memory was a thing, the columnar database vendors made very similar claims for data warehouses running on, say Vertica or Sybase IQ. With a columnar database, the argument went, you could make any query into the data and get an answer back fast without having to create all these copies of the data, which ultimately is what summaries, indexes, aggregates, cubes and even data marts are. So your data warehouse could become a lot smaller and a lot faster. Win win!

Now Leukert expands that argument, referring to SAP HANA environments, showing how the reduction of indices and aggregates from both the operational and analytical data within an organization can lead to a significant reduction of the overall enterprise data footprint. He gives the example of a typical financial booking which updates 15 separate database records, showing that in a simplified in-memory enterprise environment that number can be reduced to four database records. He goes on to claim that SAP has itself managed a 14X reduction in data footprint, with a 30X reduction expected overall.

Those are massive reductions, and should map to massive savings. And so an interesting race begins, between the explosive growth of data that the Internet of Things and other big data drivers are bringing about, and the substantial reduction that columnar databases, in-memory processing, and other technological developments can bring about. Which side will win? Maybe we really can do more with less. Or maybe these technologies simply help to curb the otherwise uncontrollable growth of big data.

Stay tuned.

—

Oh, here is Leukert (and associates’) entire talk. Most of the stuff about Negroponte is at the beginning, but he does come back to it at least once in the middle and then again at the end. At nearly an hour and 45 minutes, this is not as trying on the patience as many keynotes I have endured. Plus if you watch the whole thing, you’ll learn about the Internet of Toilets.

And, no, I’m not making that up!

Data Is Eating Us

Phil Bowermaster — Sat, 15 Nov 2014 14:10:22 +0000

Is analyzing big data more fun than eating? Well, it might just be. For some, at least.

Anyway, that is one of the premises of Platfora’s recent Soylent giveaway promotion. For those who need catching up: Platfora is a Hadoop-native big data analytics platform. Soylent is an instant meal replacement, designed to provide 100% of the body’s nutritional requirements while doing away with all that distracting and time-consuming “eating” that humans are compelled to keep doing. Where these two meet is in the lives of busy data scientists and hard-core analysts. As the Platfora blog explains it:

Hunger demands that you go right then and heat up that frozen burrito immediately. No get out of jail free card for you, my friend.

What if I told you there was another way? A magical way to throw off the Shackles of Mealtime and the depression of time-sucking, sad cafeteria lunches. A way to be free to revel in the world of data limitlessly, without the constraints of a growling stomach and hungry mind.

And that “way,” of course, is the consumption of Soylent rather than the burrito. To quote the promotional video from the Soylent home page:

Unlike most other foods which prioritize taste and texture, Soylent was engineered to maximize nutrition, to nourish the body in the most efficient way possible.

No shopping, no cooking, no figuring out what goes with what or worrying about whether you’re keeping things in balance. In the video, we meet the creator of Soylent, an engineer who has taken on human nutrition as an engineering problem, one that can be broken down to its constituent parts. In this case, the “parts” that make up nutrition are chemicals. So the solution to human nutrition is ultimately a formula.

It’s another brilliant example of datafication in action. Previously we saw how a jet engine could become smaller, cleaner, quieter, and more powerful through changing the relationship between its physical components and its data component. And we looked at the almost magical process that can transform a room full of devices into a single device that fits in the palm of your hand: a smart phone. Now apply that same magic to one of the fundamental physical processes of human survival, and voila! Soylent.

But the Platfora promotion takes it even further than that. Why datify the process of eating? One obvious reason: so you can spend more time working with data.

In the movie from which Soylent takes its name, the surprise ending (spoilers ahead) is that people are eating other people. Yikes, that’s terrifying. But that isn’t what’s happening here. In the world of Soylent, people are eating data — or at least food that leverages the maximum value of its data component.

At the same time, it is becoming increasingly apparent that data is eating us. (Some might say that software is eating us, but I say same difference.) Or if not eating us, it is at least getting the upper hand in the relationship. Here we have data maximizing the efficiency of a core human bodily function so that we might better attend to data and its needs.

Sure, a lot of people will tell you that they have no interest in using Soylent. And even among Soylent users, freeing up time to allow for more data analysis is only one of many motivations. Yes, the data is working for us. But increasingly, it seems to have us working for it. Who exactly is running this show?

Stay tuned.

Datafication in Three Easy Steps

Phil Bowermaster — Sat, 20 Sep 2014 04:48:25 +0000

The relentless wave of change that is transforming our world from being one made primarily out of stuff to one made primarily out of data has a name. It’s called datafication.

Over the past few decades, we have witnessed the datafication of business, of society, and of everyday life. There appear to be three major phases of datafication. In the first phase, an activity or process becomes increasingly reliant on data. In the second, data begins to transform the activity or process by taking a central role in its execution. In the third phase, the activity is moved entirely into the data substrate.

Take the movie business. Putting artistic considerations aside, the success of any film has always been a measurement of how much revenue it generates. Originally, this was a pretty straightforward matter of counting box office receipts. (Today, what with many and varied distribution channels and considerations such as licensing and merchandising that often come into play, the math for calculating success is considerably more complex.) The film industry entered the first phase of datafication relatively early on, as studios began trying develop formulas for repeat box office success. have witnessed the datafication of business, of society, and of everyday life. There appear to be three major phases of datafication. In the first phase, an activity or process becomes increasingly reliant on data. In the second, data begins to transform the activity or process by taking a central role in its execution. In the third phase, the activity is moved entirely into the data substrate.

The data points were, at first, relatively few and far between: geographic differences in box office; one star’s draw vs. another; westerns vs. romances vs. war movies vs. musicals; summer releases vs. Christmas releases. Over time, the analysis evolved in terms of sophistication until the industry reached the second phase of datafication. This is how we came to live in an age of scripts written for a target adolescent male audience and re-edits and even rewrites following test screenings. The data began to drive the process.

But data wasn’t done with the movies yet. The film industry is moving rapidly into the third phase of datafication. Once upon a time, filmmakers made films. Long strips of celluloid with images on them. We’ve all heard of efforts to preserve decaying movies from the early part of the last century. Film was a chemical and mechanical process resulting in a physical artifact. But not today. The product of the film-making process is now essentially a data artifact. Movies are consumed over digital networks on TVs, laptops, and smartphones. And, in fact, they can now be made entirely on smartphones. Short messages, tweets, motions pictures…it’s all the same. It’s all data.

The big data revolution is ultimately about this kind of transformation in all sectors of all industries. The movie and music businesses are obvious examples of industries that have made it at least part of the the way to phase three. But then so is the telecommunications industry. Shipping and logistics have become as much about data as they are about moving stuff around. Even manufacturing is moving in that direction — and will continue to do so as digital fabrication and 3D printing become increasingly mainstream.

Right now the world as whole is really just beginning to move from phase 1 to phase 2. Data is beginning to influence and direct the world in ways never before considered. And we are still in the very early days.

Bigger than We Realize

Phil Bowermaster — Sat, 06 Sep 2014 17:57:34 +0000

I think maybe big data is being under-hyped.

That’s right. Under.

And, yes, I know how ridiculous that sounds. And I know how suspect it sounds coming from a guy who spent all those years in product marketing, specifically marketing a product with strong big data tendencies (although we didn’t use that word to position it — or at least I never did.)

Come on now: isn’t big data being hyped enough already? It’s not like a few years ago, when so many were uncertain as to what the term meant. People get it now. They know what big data is. In fact, at least one major survey shows that big data has pretty much become mainstream. Everybody is doing it. And, interestingly, even as people have come to know what it is, to accept that it exists, and to engage in big data projects…they still don’t much care for the term.

If everybody is doing it now, what need could there be for further hype? If anything, maybe we can and should be talking about it a bit less now that it has gone mainstream. Besides, if people don’t care for the term now, more hype cycles aren’t going to do much to help, are they?

Probably not. But I’m not suggesting that big data needs more hype because I want more people to use it or to like it. I just want them to be more aware of it. Going by this source, I’m probably using the second definition of the word hype. I think we need to create greater interest even if we have to use “flamboyant or dramatic methods.”

Why?

Because big data is only getting bigger. And it is only becoming more deeply embedded in our everyday experiences. Moreover, it is changing the world itself. I realize that might sound a little overly dramatic and / or flamboyant, but let’s take a look at this. Consider this example provided a while back by Irfan Khan, head of the Global Database and Technology organization for SAP:

General Electric (GE) has recently announced substantial changes to the design of the CFM Leap aircraft engine, which powers the Airbus A320neo, Boeing 737 Max and COMAC C919 aircraft. The new generation Leap is “designed to provide significant reductions in fuel burn, noise, and NOx emissions compared to the current… engine.” It is designed to generate 32K pounds of thrust, achieve a 99.87% reliability rate, and introduce a $3 million operating saving annually.

Where will these savings come from? New sensors intricately track how the engine is operating. The use of data fundamentally transforms how the engine operates and makes it more efficient. But that efficiency requires a lot of data. The new version of the Leap aircraft engine generates 1 TB per day from those sensors alone. Add in avionics, traffic data, weather data… a massive amount of information is generated just from taking a flight. In previous versions, the Leap engine has completed more than 18 million commercial hours of operation, with some 22,000 of the engines manufactured. So we’re talking about a lot of data.

In every way but one, this engine now operates with a smaller footprint: it requires less fuel, it makes less noise, it generates fewer noxious emissions, it costs less to operate. Only in one area, data, is its footprint expanding. [Emphasis added.]

An aircraft engine becomes smaller, cleaner, and more efficient. These changes in the physical properties of the engine have been achieved by generating, manipulating, and responding to data.

Of course, if this one jet engine were the only example of such a shift, it wouldn’t be terribly persuasive. But the examples are everywhere. A few years ago there was a lot of discussion about how the new smartphones were replacing so many other devices. You no longer needed a digital camera, a music player, a gps system. The phone did it all. where did all those separate devices go? Their physical footprints were dramatically reduced; their data footprints took up the slack.

Meanwhile, consider how many of your interactions with others, how much of what you do and think and communicate, how much of your self is now closely associated with that same device. What is a human life made of? Lots of things, obviously. But increasingly, one of the biggest component parts of our lives is the data component. Like that jet engine, we’re growing bigger and bigger data footprints.

We seem to be transitioning from a world made out of stuff to a world made out of data…and stuff.

Mostly Data?

A while back I wrote a piece about de-industrialization, describing how capability that once belonged to large institutions is passing into the hands of everyday people. The most prominent examples of this phenomenon have taken place in the film and recording industries. Those same smartphones that swallowed all the other devices are now being used to make movies — the kind it used to take a whole studio to make (including one that got an Oscar nomination.)

Stephen Gordon’s In the Future, Everything Will Be a Coffee Shop is premised on this same shift from a world made primarily of stuff to a world made primarily(?) of data. R. Buckminster Fuller described a process that he called ephemeralization, whereby “you do more and more with less and less until eventually you do everything with nothing.” We are definitely doing more with less these days, although doing everything with nothing remains some distance ahead.

But as we use less stuff — less energy, matter, time and space — to do the things we do, we are using more and more data. Big data really is changing the world around us. We need to be aware of this process, and try to understand it. So let’s have some more big data hype.

It’s Pronounced “COL-um-nar”

Phil Bowermaster — Wed, 11 Jun 2014 19:21:11 +0000

Oracle is making a big, big splash with the release of Oracle 12C, touting it as “the future of the database.” If you’re as interested in the future as I am, you are probably wondering what, exactly, the future of the database will look like. In the video below, Larry Ellison lays it all out for us, but allow me to summarize. The future of the database is:

Columnar
In-memory

Okay, great.

Um, well…

Okay, this is awkward. But aren’t those ideas pretty much already implemented in databases that exist today? Actually, as I have probably mentioned on occasion, I used to work for a company that made one of those columnar deals. And it is hardly the only one.

(A quick side note on terminology. To tell you the truth, I always preferred the term “column-based,” but “columnar” seems to be the term that stuck. That’s fine. But it is not pronounced “co-LOOM-nar,” okay? Who says that? I think of the fun the old IQ team would have had with the fact that Oracle, literally, can’t even pronounce “columnar!”)

As for in-memory… These folks claim to have been doing it for a while. As have these folks. And there’s this. And this.

Okay, so it isn’t exactly the future of the database, but it is the future of the Oracle database. It’s easy to see how someone could get confused about that distinction.

In any case, snark aside, they have made some pretty impressive changes. Read all about it here.

Or better yet, hear it directly from Larry: