Is Big Data a Bubble?

Much has been written about big data (insert air quotes) over the last 12 months and articles are now regularly showing up in mainstream publications (also see: Six Provocations for Big Data, IBM‘s Big Data landing page, and a couple of NYT articles from the past few months here and here). During a panel during The TV of Tomorrow conference held earlier this month in San Francisco, Jeremy Toeman suggested big data was a bubble.  He made this comment with a reference to twitter and other similar data.  I’ll call these data public data – which suggests there are private data which I’ll talk more about below.

I countered Jeremy’s comment on twitter (ironically) that data are more than just twitter. At which point Jeremy suggested data [were] still overhyped.

My response on twitter suggested the term overhyped had a somewhat technical reference implying over-investment.  I’ll elaborate on this shortly.

The distinction between public data and private data is an important one.  Publicly available data are available to anyone and therefore anyone with sufficient computing resources can capture these data.  And there are definitely examples of companies (and even the President) mining public data like twitter to help make more informed decisions. Private data on the other hand is captured by privately held devices.  These data would include things like personal health metrics, location-based data, etc.  One can make private data publicly available (in turn making it public data), but the reverse is not true – you can’t turn public data into private data. Another flavor of publicly available data are data available to a large group of individuals. We could call this quasi-public data. Sites like We Know What You are Doing, and Please Rob Me have taken advantage of data in the limbo between private and public data.

My comment to Jeremy pointed out that the most valuable data – private data – is often being captured in the process of performing a service suggesting data capture isn’t the primary purpose.  In other words, data captured solely to capture data might not be very valuable, but data captured while providing another service can be very valuable.  These data get at user habits, attitudes, and preferences indirectly.  When data can reveal our underlying preferences then over time computing and algorithms can begin to make decisions on our behalves.  Many worry about the wrong decisions made but I would counter that we are only going to outsource unimportant decisions to algorithms until the algorithms improve to a point where the potential costs of a bad decision do not outweigh the potential benefits of a good decision.

I also suggested in my reply that overhyped is typically accompanied by over-investment and that does not appear to be the case with data oriented services currently.  This is especially the case with private data.   The potential is certainly there and the term “big data” like “cloud computing” is definitely a darling of the news media, but I don’t see a traditional bubble in data just yet.

There has definitely been significant investment in public data projects, but there is still more investment needed (and coming) in private data.