Just go out and collect enormous amounts of data, put it through the right pre-processing and turn the crank on various statistical learning algorithms – and all your problems will be solved. That is the argument being made in this opinion piece: Alon Halevy, Peter Norvig and Fernando Pereira, The Unreasonable Effectiveness of Data, IEEE Intelligent Systems, pp. 8 – 12, Mar – Apr 2009.
As an AI and machine learning researcher myself, I should be happy with this argument. But, their argument seems a bit too glib and something about it makes me rather uneasy. One source of unease is that the story has been far from this simple in most domains that I really care about (ranging from humanoid robotics to computational finance). One (but surely not the only) common theme underlying these application is the need for autonomous decision making in a closed-loop and ‘large’ world setting, as I’ve argued in previous posts. I often hear people say that one must just be patient – as component algorithms (for everything from object recognition to on-line regression) get better, these problems will go away. Personally, I don’t buy the simplistic version of that argument either – I think we’re missing much more.
Now, the authors are well established AI researchers (many of us started in AI with the Russell and Norvig book!) and have no dearth of exposure to problem domains. However, their current interest – as demonstrated by the problems discussed in the article and knowing who their employer is – seems restricted to a class of problems that characterize the business of google/facebook/netflix/… Here, data is indeed plentiful (how many web pages and random facebook comments are there?!). Moreover, domain experts do not seem to know all that much (e.g., what is the best descriptor of social cohesion or movie preferences?) – so machine learning can have a field day. It is perhaps worth noting that in most of these commercial successes, one doesn’t allow the ‘agent’ to stray too far on its own – success is as much determined by careful feature design and data pre-processing as by algorithmic sophistication. Indeed, when I started reading about the recent Netflix contest, I was struck by how little core innovation was required compared with the elbow grease on the part of a very talented set of statistical experts. So, the end result was far from the ‘autonomous agent’ of my dreams.
One big difference between the decision making problems of interest to me and, say, statistical machine translation using a large corpus, is that it is quite hard to get the data that enumerates all contingencies (and changes in dynamics associated with those contingencies). How exactly should I go about finding the data necessary to enumerate and statistically model all possible shocks in a complex market? Then, how can I come up with decision making strategies that work with this kind of data (incomplete as it necessarily and unavoidably is)? In my world, things change all the time and that is the very essence of the problem! These are the issues that make generic frameworks like reinforcement learning intractable…
Incidentally, the authors knock economists for their fiction of a concise theory. But, the same domain is an excellent case for my point. What then is a good alternative for decision making? The reason people (e.g., quants) still depend on analytical theories, despite well known inconsistencies, is that it provides a basis for analysis which is crucial (e.g., even if I use non-parametric models of price processes, I do need to come back and ask questions about risk analysis – which makes quite different demands and has different implications for model structure!)
So what is one to do? Don’t get me wrong – I do believe that data-driven methods have their benefits. Data is indeed unreasonably effective, in certain settings. However, we need to enable our agents to get their own data – after reasoning about what they want to do with it – and couple it with constantly changing representations – depending on which problem formulation allows the decision problem of the minute to be tractably (e.g., with bounded rationality and time) solved. If we are to one day have human competitive autonomous agents, perhaps we should not restrict ourselves to the isolated statistics-in-a-database paradigm that seems so convenient in the few famous commercial successes of the day.