More and more data – is that all we need for robust autonomy?

Just go out and collect enormous amounts of data, put it through the right pre-processing and turn the crank on various statistical learning algorithms – and all your problems will be solved. That is the argument being made in this opinion piece: Alon Halevy, Peter Norvig and Fernando Pereira, The Unreasonable Effectiveness of Data, IEEE Intelligent Systems, pp. 8 – 12, Mar – Apr 2009.

As an AI and machine learning researcher myself, I should be happy with this argument. But, their argument seems a bit too glib and something about it makes me rather uneasy. One source of unease is that the story has been far from this simple in most domains that I really care about (ranging from humanoid robotics to computational finance). One (but surely not the only) common theme underlying these application is the need for autonomous decision making in a closed-loop and ‘large’ world setting, as I’ve argued in previous posts. I often hear people say that one must just be patient – as component algorithms (for everything from object recognition to on-line regression) get better, these problems will go away. Personally, I don’t buy the simplistic version of that argument either – I think we’re missing much more.

Now, the authors are well established AI researchers (many of us started in AI with the Russell and Norvig book!) and have no dearth of exposure to problem domains. However, their current interest – as demonstrated by the problems discussed in the article and knowing who their employer is – seems restricted to a class of problems that characterize the business of google/facebook/netflix/… Here, data is indeed plentiful (how many web pages and random facebook comments are there?!). Moreover, domain experts do not seem to know all that much (e.g., what is the best descriptor of social cohesion or movie preferences?) – so machine learning can have a field day. It is perhaps worth noting that in most of these commercial successes, one doesn’t allow the ‘agent’ to stray too far on its own – success is as much determined by careful feature design and data pre-processing as by algorithmic sophistication. Indeed, when I started reading about the recent Netflix contest, I was struck by how little core innovation was required compared with the elbow grease on the part of a very talented set of statistical experts. So, the end result was far from the ‘autonomous agent’ of my dreams.

One big difference between the decision making problems of interest to me and, say, statistical machine translation using a large corpus, is that it is quite hard to get the data that enumerates all contingencies (and changes in dynamics associated with those contingencies). How exactly should I go about finding the data necessary to enumerate and statistically model all possible shocks in a complex market? Then, how can I come up with decision making strategies that work with this kind of data (incomplete as it necessarily and unavoidably is)? In my world, things change all the time and that is the very essence of the problem! These are the issues that make generic frameworks like reinforcement learning intractable…

Incidentally, the authors knock economists for their fiction of a concise theory. But, the same domain is an excellent case for my point. What then is a good alternative for decision making? The reason people (e.g., quants) still depend on analytical theories, despite well known inconsistencies, is that it provides a basis for analysis which is crucial (e.g., even if I use non-parametric models of price processes, I do need to come back and ask questions about risk analysis – which makes quite different demands and has different implications for model structure!)

So what is one to do? Don’t get me wrong – I do believe that data-driven methods have their benefits. Data is indeed unreasonably effective, in certain settings. However, we need to enable our agents to get their own data – after reasoning about what they want to do with it – and couple it with constantly changing representations – depending on which problem formulation allows the decision problem of the minute to be tractably (e.g., with bounded rationality and time) solved. If we are to one day have human competitive autonomous agents, perhaps we should not restrict ourselves to the isolated statistics-in-a-database paradigm that seems so convenient in the few famous commercial successes of the day.


4 thoughts on “More and more data – is that all we need for robust autonomy?

  1. The problem is it is unfalsifiable. If you try it and it doesn’t work then you didn’t have enough data (or you didn’t have the right algorithms). At least there is verifiable. Maybe someday they will be proven right.

    • Alan, if I understood your comment correctly, you are saying that an active approach (wherein the agent is in charge of collecting data, reasoning about representation change and formulating a decision strategy – all in one integrated loop) is unfalsifiable. If so, you are saying that corpus-based methods give you more verifiable models.

      I don’t think this is accurate. As such, actively learning from experience can be done in a very principled way. For instance, reinforcement learning provides a body of algorithms with strong guarantees regarding convergence, sample complexity requirements, etc. In many cases, I would argue that this draws on a deeper body of mathematical foundations (in the form of the theory of stochastic games) than alternate forms of statistical models that do not actually address the problem of decision making over time. So, I don’t know what verifiability even means in that sense. Now, many modern trends within machine learning acknowledge the relevance of on-line and multi-task learning. Independently, many people in AI are coming back to ask about representation refinement methods. But only very few people seem to be looking at the intersection of the two, or even thinking of that intersection as being important.

      The main difficulty with RL and other active learning tools is computational tractability when applied naively/directly. So, any time one manages to solve such a problem, that person has thought carefully about what data to use (crucially, which data to avoid), what features to use (correspondingly – avoid), etc. So, my point is – shouldn’t we be looking at (as a few people are already trying to do) how the algorithm can incorporate this crucial sleight of hand? Without that, data can’t do magic in and of itself. Then, we can extend analytical guarantees to take this into account as well. Of course, this is a tall order and may take time – but that’s why we do research!

      On the latter point – I don’t think this is the sort of issue where anyone can be proven right. If one restricts attention to domains where the world is stationary and people provide plenty of data which more or less enumerates every contingency, then of course an efficient nonparametric algorithm that mines it will suffice. The question, as I asked in my post – in other domains, where stationarity is a non-sensical assumption (e.g., what shocks will impact your pension fund) and there is no sensible way to enumerate all contingencies within a historical database, what choice does one have other than to adopt an online and adaptive approach that includes reasoning about a strategic adversary and changing both representations and strategies online?

      We have just submitted a paper on such an asset allocation agent which does take the truly data-driven approach advocated by Halevy et al. The initial results (and reviews – both by academics and practitioners) have been encouraging. However, when I sit back and ask what I am least happy about in that work – it is the lack of foresight to generate what wasn’t in the database. So, this is what I am really keen to get on with next!

  2. The question is, for an event with tiny probabilities, what does a contingency really mean? At its core, RL is a combination of data-based approach and some structural insight, but since it does not have foresight, it will still be mislead in a disaster scenario.

    Now obviously, foresight can come in two flavors. Some type of external prior knowledge is incorporated, such as the law of physics for a walking biped, in which case, one can argue that “data is not everything”, since I could have plans that scale to various situations which I might never observe in a 100 years, but still have positive probability.

    In finance this is less clear. What is a foresight in this case? My definition for a financial system case would be more in terms of contingency plans: if a shock happens, I can move from the “unstable” post shock equilibrium to a low cost “stable” equilibrium. The cost/path for such a move shouldn’t be exorbitant. Conceptually it is different from the traditional efficient frontier types of argument, because there one is still relying on data to get the convex decision boundaries.

    Is RL the right approach to get there? I am not sure. On the other end of the spectrum, in the safe operations scenario, possibly a statistical model can perform as well as a human. For example, in trading for currencies, decision models can perform better than humans since the frequency of decision making is increased… Maybe there, the basins of attraction of the equilibrium in which the market operates is wide…

    Still, last week I learned that billions were made within days banking on the bonds for the Dubai World construction company. Such trading has nothing to do with system dynamics, but with information aggregation. Which in a way goes to the heart of database statistics etcetra. There has been some work on aggregating such signals from the web. But still the web, is certainly not the same as having a hotline to the sheik…

    • Actually, I am not all that committed to the technical tools that are currently available for RL (where I fully agree that there isn’t enough foresight to address the core problem). Instead, I am arguing that the aspiration of the active learning paradigm underlying the ‘RL problem’ should be taken more seriously. Under this broad definition of the problem, I include a whole gamut of related ideas ranging from bandit problems to online learning (e.g., Cesa-Bianchi and Lugosi’s book entitled Prediction, Learning and Games) and even various ideas within economics (e.g., I have been occasionally browsing a book on the Economics of Search by McCall and McCall which surveys a rich literature on how to actively ‘look for things’ in an economic agent context; another interesting source seems to be Peyton Young’s work summarized in a short book entitled Strategic Learning and its Limits). So, I am really arguing that – for my problems of interest – active learning, proactively driven by the agent embedded in the problem domain, is the way to go. The question then is how best to do this. Again, just to be clear, I am perfectly happy if statistical tools offer useful solutions. However, at some level, many traditional statistical tools seem to stop just short of the ‘decision problem’, leaving many things to be handled by the user of the tools. In this context, it seems to me that there may be a role for other types of reasoning tools that can guide such a search process. I am keeping my options open on that front.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s