I spent more than a decade forecasting futures as the manager of a hedge fund. We had tick-by-tick data going back decades, but there was a huge random component to this data that made automated prediction beyond a certain accuracy impossible. All the motives people have for buying and selling at a particular moment, combined with the sheer number of people trading, meant that no matter what we did, we’d never perfectly pluck signals from the noise.
In data science, we call these intractable problems, and, past a certain point, analytics and big data may simply never make progress.
The good news is that many problems that at first seem intractable can be addressed by tweaking your approach or your inputs.
Knowing when problems that seem intractable can be solved with some affordable changes will position a business—and a project sponsor—for ongoing success. Conversely, being able to recognize problems that are defined at an unrealistic scale will prevent squandering time and money that you could profitably apply to a more focused question.
Here are four troubleshooting methods that could improve your results. By iteratively applying one or more of them, you could exchange banging your head against a wall for increasing the chances of finding value in your analytics work.
1. Ask a More Focused Question
Often, the best way forward is to try to solve for a subpart of your original question and extrapolate lessons. Trying to determine the likelihood that any given social-media user will be interested in a car model you’re designing is likely intractable. Even with lots of good data, you might have too many variables to arrive at a model with real predictive value.
Sometimes when you add a new set of data, skies open up, and you find new predictive power.
But you might be able to predict an increase or decrease in sales to a specific demographic. From there, you could determine whether a change such as a boxier design would boost sales to soccer moms more than they would hurt sales to single twentysomethings. That’s a more manageable problem scope that still delivers real value to your business.
The same approach can help you isolate variables that are throwing off your algorithm. Instead of trying to predict hospital readmission rates for all patients, for example, you might divide a patient set into two groups—perhaps one of patients with multiple significant conditions and the other of patients with only a single condition, such as heart failure.
If the quality of prediction in one group shows a meaningful improvement or decline, that would indicate that your algorithm works for a data set that is not just smaller, but specifically clear of a particular confounding variable present in the larger pool.