The Success & Resounding Failure of "Big Data"
Big Data hype faded, but its lessons matter—most projects fail without clear goals. The future lies in turning info into actionable knowledge.
A Retrospective View from the Top of the Hype Cycle
Image by Samuele Porta
While doing some research the I stumbled across a modified copy of a presentation I did at a Restaurant Marketing Executives conference right when “Big Data” was at the peak of Gartner’s Hype Cycle. It’s hard to believe that was almost a decade ago, and it’s been years since Gartner stopped tracking “Big Data” altogether. Buzz words are a funny, fickle thing, especially when they suddenly overrun an industry that was previously inhabited by geeks and introverts. Within a matter of a few years we got a title upgrade from ‘data geeks’ to ‘data scientists’, then in an unsettling headline HBR said we were the sexiest profession of the 21st century. By 2014 Big DATA was at the top of the hype cycle and investment dollars were pouring into any business that could pivot their vernacular fast enough. It was a surreal turn of events and suddenly there were people calling themselves “Big Data” experts who had never even laughed at a math joke, and I was on stage speaking in front of a crowd of 300 marketing executives, many of whom had slept through my data presentations in their own board rooms long before Data Scientists were sexy. Like all buzzwords, it was just a matter of time until the phrase “Big Data” began its tumble into obscurity, but the limited life of the hype should not be confused with the economic and social impact of the movement behind the hype. The data deluge is real, and we’re still in the early innings in terms of realizing seismic economic and social impact, but the impact has been different from what many of us anticipated.
The success of ‘Big Data’
Estimates of the total economic value of the data economy vary wildly from hundreds of billions to trillions of dollars in 2020, so the size of the phenomenon may be in dispute but the fact that it is huge no longer is. Though avoiding the word “Big” in the article The Economist predicted in 2017 that the ‘data economy’ will be the driving force of this century the same way fossil fuels drove the 20th century. In most estimations placing a value on today’s ‘data economy’, they account for data movement, processing and storage companies plus the advertising and commerce revenue for the likes of Amazon, Google and Facebook.
The Failure of ‘Big Data’
Despite massive growth in the data economy, and wide acceptance that business leveraging data are more successful than those that don’t — ‘big data’ and data science implementations have had a shockingly low success rate. While it’s hard to get a precise number, estimates I’ve found range from 60%-85% failure rate.
A 60–85% failure rate is an unnecessarily dismal record, and while there are many reasons why data projects fail — the culprit is often because the project is spearheaded by data infrastructure experts who don’t have a full understanding of the bigger business goals that the data project serves. Often their credentials are impeccable for building an ecommerce site, or a basic reporting engine but have little understanding of how profoundly different the technical & process requirements are for a data science initiative versus an e-commerce or reporting engine.
Successful data science projects start with a clear vision of a specific business problem that needs to be solved and are driven forward by teams that are aligned in that vision from top to bottom. Unfortunately that’s not how most data projects in the ‘Big Data’ era started. Instead they’d start with the construction of an all purpose data platform or ‘data lake’ designed to store all data from all sources, without getting into the minutia of the specific data uses. The first stop was a seemingly inexpensive HDFS file system (aka Hadoop), which seemed like cheap file storage — unless you wanted to query it — then you’d have to assemble a team of engineers, and since no one had experience with the new technology you’d have to poach them from Facebook and pay each of them more than your CFO. Once you had your data lake up and running 6–12 month later, you still needed to build a reporting layer on top of it. If you happened to be responsible for creating insights with the data through custom analyses, odds are, you were a distant afterthought and there wasn’t really an access point for you in this technology stack. So you jerry-rig an export through the reporting layer into a sandbox server where you have to plead with IT to let you install Python Jupyter notebooks or an R server. Adding insult to injury, you discover that nowhere in this super-powered modern data architecture has ongoing data hygiene been contemplated, so the ‘simple’ data export you rigged returns the same errors with every pull and there is no protocol for correcting the upstream data. In order to overcome the shortcomings of the massive data lake that actually makes the relevant data tortuously inaccessible, your simple export becomes a separate work-stream and database with an independent ETL process written in python or R and your “sandbox” rapidly starts to rival the master data lake in size. Since your work is still an unofficial part of the data stack, your ML code remains in your sandbox (or maybe your laptop). God help the poor sucker who tries to pick up where you left off when you leave!
As the organizations demand evermore from data and the expectations of a world powered by AI/ML comes into focus, the shortcomings of the data systems and processes built during the first phase of this data revolution is the dirty secret that no one wants to admit about their overpriced data lake investment. But its the truth and it’s widespread and it's glaring.
Knowledge Proliferation
Transforming the massive deluge of bits and bytes of data into meaningful information is a monumental task that is transforming entire industries and reshaping the global economy before our eyes. However, this pales in comparison to the next level, which is transforming the information into knowledge. I’m not just making a semantic argument — knowledge infers a deeper understanding of the patterns within information and the ability to draw inferences. Knowledge infers facts, and understanding — if you believe in something that is false, you don’t actually have knowledge but you do have information. This is a crucial distinction in these times, because information moves much faster than our ability to determine fact from fiction, correct or incorrect and bad actors are often rewarded for being first regardless of accuracy or truth. At the organizational/project level, when you start by defining project goals by the knowledge/facts that you are seeking and align the systems and team around the knowledge goals rather than being driven first by the data that you’re putting into it, you’ll get to an answer faster, cheaper and more persistently.

Image by Samuele Porta
At the societal level, a massive explosion of data and information like we are experiencing should lead to an equal or greater explosion in science, technology and culture. An enlightenment the likes of which the world hasn’t seen since the invention of the mechanical printing press. AI will be at the center of this transformation, not as decision making robots — but rather as a purveyor of facts, knowledge and if done right, dare I say - unbiased truth. However, in our current data quagmire real damage is being done to the fabric of our society by our inability to transform information into knowledge before it can be disseminated as misinformation and propaganda. From politics to anti-science movements it’s amplifying voices that not only lack substance, but do real harm. The success of “Big Data” or whatever we care to call it today has never been more important.
Read the original post and subscribe for updates here.
Share