The Most Boring Guy at the AI Party
Bad data kills AI. Skip the hype—strong results come from boring, meticulous data work, not flashy shortcuts.
Dirty data ain't sexy
A funny thing happened just after I kicked off my last startup. The global economy melted down compliments of some Wall Street ‘Quants’ and ‘Financial Engineers’ who designed some exotic financial products based on mortgage loans. The great part about mortgage loans – according to portfolio theory - is that if you put a whole lot of them into one investment instrument there’s almost no risk because people don’t want to lose their homes, so mortgages are the last thing they default on. To improve the yield, they mixed in high risk mortgages, theorizing that it would increase returns without adding risk. This was a product of some mathematical gymnastics compliments of the sophisticated, but opaque models provided by the talented legions of Wall Street ‘quants’. These instruments became extremely popular when a series of derivative products were built on top of them (essentially bets that the security would default). The banks were so confident that the instruments were safe that they sold them hand over fist – and bets were created on the bets, and bets on top of those bets. They became so popular that the banks literally couldn’t write mortgage loans fast enough to satisfy demand…. So for a little while in the ‘00’s it became insanely easy to get a mortgage. No money down, interest only, appraisals were meaningless. The rest is history… When 70% of wall street got canned overnight it was suddenly much more affordable to hire quants. So, I was able to hire some great talent for my fledgling analytics company to develop the original algorithms for my first analytics startup. It wasn’t long before they all got called back to their jobs on Wall Street, often with a big promotion because they were the only people who had any idea how to unwind the mess that had been created through the original algorithms and instruments.
Amplify the Nonsense
One of my professors from grad schools frequently professes “math is easy, data is hard”, which usually elicits some cock-eyed looks – but she’s right. For those of us working on tangible solutions for real-world problems with advance statistics, data science & machine learning, the bulk of the work & the key to getting the right answer is always in the data prep. Visualizing, white boarding & scripting quantitative solutions to problems is usually relatively quick compared to the data work & frankly it’s the fun part. The basics of the most complex approaches are pre-coded in some library, or ready-to-go through a commercial platform. But the data… that’s another story altogether. Scarcity of data used to be the big problem, now it’s wrangling the data overload. It makes the danger of amateurs wreaking havoc higher. Plentiful data & open libraries come loaded with potential bias and skew and errors, and often pure nonsense – and there have long been loud voices insisting that “the models will sort it out”. That chorus has become even louder as legions of newly minted AI experts declare that ChatGPT or Llama, or Grock will fix it all… sigh… What’s difficult to get people to understand is that AI applied to bad data does nothing but amplify the bias, skew & the nonsense. The difficultly isn’t because it’s beyond their intellect, it’s because it’s such a boring, unsexy reality that puts a damper on a much better AI narrative. They don’t want to let go of the fantasy narrative where AI does all the crappy things for us, and everyone can make a living having robots write articles and make custom art like a psychedelic Vermeer. And let’s not forget about valuation, just plug into an LLM account and slap on an ‘AI’ sticker! This is not a moment when anyone is particularly interested in hearing the guy questioning data sources, biases & methods under the surface of the “AI Inside” labels. It makes one about as much fun as someone in 2007 who posed the question: “What exactly makes a portfolio full of subprime mortgages in Florida low risk? How many bad mortgages does your model say it takes to make a good security?” Nobody likes the jerk trying to spoil the money-party, and even if he’s right that just makes him a smug jerk.
There is No AI without Data
To be clear, I’m not predicting the imminent collapse of all things built on AI, and I’m not an AI detractor. Quite the opposite, its success is my livelihood and I’m optimistic about how it will change our lives for years to come. I do not spend time worrying about machines gaining consciousness and conspiring to control us like a magic omniscient brain in the sky. I do, however, worry about hype, ignorance, and ambivalence because it doesn’t take ‘bad actors’ to result in bad outcomes. Artificial intelligence is structured in math, put in motion by code and fueled by data. Like any machine, if the fuel (data) is tainted, the machine won’t work the way it was intended, and higher quality fuel (data) gets better performance.
Bet on Boring
Throughout most of my career in applying advanced mathematics to large data sets, few people outside the data/analysis group cared about the methods we employed, fewer still could name an ML procedure. For practitioners ML methods dominate the toolkit because of the results they produce and the efficiency they create. You don’t hear people in the trenches doing the work talking about ML & AI, you hear them talking about results and code and about specific processes applied & how to improve them. Machine learning methods are so ubiquitous in organizations with analytical & data bona fides that to call it out as something of note is almost embarrassing. When hype cycles kick into full swing, new investment and new players enter the market and everyone focuses on the ‘sexiest’ application that can be in market the fastest that makes the most buzz-worthy elevator pitch. In retrospect ‘sexy’ fizzles, what endures are applications that painstakingly got the data right for a specific audience and made it accurate and relevant. It’s why FinTech can make applications that do practical things with accuracy (slowly = but eventually), but MarTech always comes up short on the promise. Marketers buy sizzle and accountants buy practicality… boring. For some of us data is not boring. In fact, it’s exciting and so are moments of discovery when applying new methods to data that work better than expected or reveal insights you didn’t expect. The AI hype cycle is going to continue for quite a while and because the field is so broad it will have many manifestations over the coming years. Eventually all businesses will have ‘AI Inside’ whether they realize it or not – so how will you know if it’s an asset or a liability to engage with them? Get boring and ask about the data inputs; why certain data were sourced, how the quality is maintained, and why is ml/ai used at all? Even if you’re a data novice, the answers may be telling.
Read the original post and subscribe for updates here.
Share