A poet once said, “The whole universe is in a glass of wine.” We will probably never know in what sense he meant that, for poets do not write to be understood… How vivid is the claret, pressing its existence into the consciousness that watches it! If our small minds, for some convenience, divide this glass of wine, this universe, into parts– physics, biology, geology, astronomy, psychology, and so on– remember that nature does not know it! So let us put it all back together, not forgetting ultimately what it is for. Let it give us one more final pleasure: drink it and forget it all!
-Richard Feynman, from Six Easy Pieces
The Netflix Tech Blog has this very interesting piece which is very insightful: http://techblog.netflix.com/2011/01/how-we-determine-product-success.html.
In particular, this point is very important although easily and often very voluntarily ignored by so many researchers:
There is a big lesson we’ve learned here, which is that the ideal execution of an idea can be twice as effective as a prototype, or maybe even more. But the ideal implementation is never ten times better than an artful prototype. Polish won’t turn a negative signal into a positive one. Often, the ideal execution is barely better than a good prototype, from a measurement perspective.
This very nice article by Isaac Asimov explains the point (thanks to my friend, Vikram Chandrashekar, for the pointer):
Just go out and collect enormous amounts of data, put it through the right pre-processing and turn the crank on various statistical learning algorithms – and all your problems will be solved. That is the argument being made in this opinion piece: Alon Halevy, Peter Norvig and Fernando Pereira, The Unreasonable Effectiveness of Data, IEEE Intelligent Systems, pp. 8 – 12, Mar – Apr 2009.
As an AI and machine learning researcher myself, I should be happy with this argument. But, their argument seems a bit too glib and something about it makes me rather uneasy. One source of unease is that the story has been far from this simple in most domains that I really care about (ranging from humanoid robotics to computational finance). One (but surely not the only) common theme underlying these application is the need for autonomous decision making in a closed-loop and ‘large’ world setting, as I’ve argued in previous posts. I often hear people say that one must just be patient – as component algorithms (for everything from object recognition to on-line regression) get better, these problems will go away. Personally, I don’t buy the simplistic version of that argument either – I think we’re missing much more.
Now, the authors are well established AI researchers (many of us started in AI with the Russell and Norvig book!) and have no dearth of exposure to problem domains. However, their current interest – as demonstrated by the problems discussed in the article and knowing who their employer is – seems restricted to a class of problems that characterize the business of google/facebook/netflix/… Here, data is indeed plentiful (how many web pages and random facebook comments are there?!). Moreover, domain experts do not seem to know all that much (e.g., what is the best descriptor of social cohesion or movie preferences?) – so machine learning can have a field day. It is perhaps worth noting that in most of these commercial successes, one doesn’t allow the ‘agent’ to stray too far on its own – success is as much determined by careful feature design and data pre-processing as by algorithmic sophistication. Indeed, when I started reading about the recent Netflix contest, I was struck by how little core innovation was required compared with the elbow grease on the part of a very talented set of statistical experts. So, the end result was far from the ‘autonomous agent’ of my dreams.
One big difference between the decision making problems of interest to me and, say, statistical machine translation using a large corpus, is that it is quite hard to get the data that enumerates all contingencies (and changes in dynamics associated with those contingencies). How exactly should I go about finding the data necessary to enumerate and statistically model all possible shocks in a complex market? Then, how can I come up with decision making strategies that work with this kind of data (incomplete as it necessarily and unavoidably is)? In my world, things change all the time and that is the very essence of the problem! These are the issues that make generic frameworks like reinforcement learning intractable…
Incidentally, the authors knock economists for their fiction of a concise theory. But, the same domain is an excellent case for my point. What then is a good alternative for decision making? The reason people (e.g., quants) still depend on analytical theories, despite well known inconsistencies, is that it provides a basis for analysis which is crucial (e.g., even if I use non-parametric models of price processes, I do need to come back and ask questions about risk analysis – which makes quite different demands and has different implications for model structure!)
So what is one to do? Don’t get me wrong – I do believe that data-driven methods have their benefits. Data is indeed unreasonably effective, in certain settings. However, we need to enable our agents to get their own data – after reasoning about what they want to do with it – and couple it with constantly changing representations – depending on which problem formulation allows the decision problem of the minute to be tractably (e.g., with bounded rationality and time) solved. If we are to one day have human competitive autonomous agents, perhaps we should not restrict ourselves to the isolated statistics-in-a-database paradigm that seems so convenient in the few famous commercial successes of the day.
I helped organize one of the events in the Edinburgh International Science Festival – a talk by Prof. Simon Tett entitled A gentle introduction to climate modelling. As one of the hosts of the event, I had dinner with the speaker and had the opportunity to discuss his views regarding what is hard about modelling these types of processes.
Clearly, the weather seems rather unpredictable and complex. So, what does it take to understand it well enough to be able to do long term predictions – such as for climate change related questions. Is data really the bottleneck as many people believe?
Simon pointed that most of these models are based on a somewhat coarse discretization of the underlying process.For instance, fluid and heat transfer models might be based on cells that are something like 100 square miles across. Clearly, this is going to be ignoring a significant amount of fine structure – some part of which will undoubtedly have larger scale implications. Moreover, the simple solution of just throwing more processors at the problem doesn’t suffice because the problem scales poorly and there is significant sequential structure to these simulations. So, in his view, the hard questions all have to do with structuring models so that both the coarse and fine dynamics can be reasonably captured concisely. Clearly, uncertainty and probabilities play a key role. However, this is quite different from the naive approach of just collecting more and more data in order to refine the underlying distributions. The trick is to first structure the model well enough that as more and more data comes in we really do get a sequentially improved idea of what we really want to understand – the dynamical behaviour of this large system.
My past few posts have been driven by an underlying question that was pointedly raised by someone in a discussion group I follow on linkedin (if you’re curious, this is a Quant Finance group that I follow due to my interest in autonomous agent design and the question was posed by a hedge fund person with a Caltech PhD and a Wharton MBA):
I read Ernest Chan’s book on quantitative trading. He said that he tried a lot of complicated advanced quantitative tools, it turns out that he kept on losing money. He eventually found that the simplest things often generated best returns. From your experiences, what do not think about the value of advanced econometric or statistical tools in developing quantitative strategies. Are these advanced tools (say wavelet analysis, frequent domain analysis, state space model, stochastic volatility, GMM, GARCH and its variations, advanced time series modeling and so on) more like alchemy in the scientific camouflage, or they really have some value. Stochastic differential equation might have some value in trading vol. But I am talking about quantitative trading of futures, equities and currencies here. No, technical indicators, Kalman filter, cointegration, regression, PCA or factor analysis have been proven to be valuable in quantitative trading. I am not so sure about anything beyond these simple techniques.
This is not just a question about trading. The exact same question comes up over and over in the domain of robotics and I have tried to address it in my published work.
My take on this issue is that before one invokes a sophisticated inference algorithm, one has to have a sensible way to describe the essence of the problem – you can only learn what you can succinctly describe and represent! All too often, when advanced methods do not work, it is because they’re being used with very little understanding of what makes the problem hard. Often, there is a fundamental disconnect in that the only people who truly understand the sophisticated tools are tools developers who are more interested in applying their favourite tool(s) to any given problem than in really understanding a problem and asking what is the simplest tool for it. Moreover, how many people out there have a genuine feel for Hilbert spaces and infinite-dimensional estimation while also having the practical skills to solve problems in constrained ‘real world’ settings? Anyone who has this rare combination would be ideally placed to solve the complex problems we are all interested in, whether using simple methods or more sophisticated ones (i.e., it is not just about tools but about knowing when to use what and why). But, such people are rare indeed.
For many years now, beginning with some questions that were part of my doctoral dissertation research, I have been curious about multi-level models that describe phenomena and strategies. A fundamental question that arises in this setting is regarding which direction (top-down/bottom-up) takes primacy.
A particular sense in which this directly touches upon my work is in the ability of unsupervised and semi-supervised learning methods to model “everything of interest” in a complex domain (e.g., robotics) so that any detailed analysis of the domain is rendered unnecessary. A claim that is often made is that the entire hierarchy will just emerge from the bottom-up. My own experience with difficult problems such as synthesizing complex humanoid robot behaviours makes me sceptical of the breadth of this claim. I find that, often, the easily available observables do not suffice and one needs to work hard to get the true description. However, I am equally sceptical of the chauvinistic view that the only way to solve problems is to model everything in the domain and dream up a clever strategy or the defeatist view that the only way to solve the problem is to look at pre-existing solutions somewhere else and copy them. Instead, in my own work, I have searched for a middle ground where one seeks general principles on both ends of the spectrum and tries to tie it together efficiently.
Recently, while searching google scholar for some technical papers on multi-level control and learning, I came across an interesting philosophical paper (R.C. Bishop and H. Atmanspacher, Contextual emergence in the description of properties, Foundations of Physics 36(12):1753-1777, 2006.) that makes the case that extracting deep organizational principles for a higher level from a purely bottom-up approach is, in a certain sense, a fundamentally ill-posed problem. Even in “basic” areas like theoretical physics one needs more context. Yet, all is not lost. What this really means is that there are some top-down contextual constraints (much weaker than arbitrary rigid prescriptions) that are necessary to make the two mesh together. You will probably have to at least skim the paper to get a better idea but I think this speaks to the same issue I raise above and says something quite insightful.