Our agent, Edart, at the Trading Agent Competition 2015

My student, Stavros Gerakaris, has been working on applying our multi agent learning ideas to the domain of Ad Exchanges. He is participating in the Sixteenth Annual Trading Agent Competition (TAC-15), conducted as part of AAMAS-15. His entry, entitled Edart, finished 2nd among the participants and 5th overall, the last year’s winners still did better than all of us. This earns us a spot in the finals. If you’d like to know more about the background and setup of this competition, see this paper by Mariano Schain and Yishay Mansour.

People familiar with AMEC/TADA will realise that the main objective of these competitions is to try out our original ideas in a demanding open ended domain. In this sense, I am especially pleased that this agent had begun to validate our more theoretical work in the form of the Harsanyi-Bellman Ad-hoc Coordination algorithm, originally developed by Stefano Albrecht, which Stavros is using in a partially observable and censored observation setting. In due course, this work will appear as a publication, so watch that space in our publications list.

Action Priors and Place Cells

My former student, Benji Rosman, and I worked on an idea that we called action priors – a way for an agent to learn a task-independent domain model of what actions are worth considering when learning a policy for a new sequential decision making task instance. This is described in a paper that has recently been accepted to the IEEE Transactions on Autonomous Mental Development.

This has been an interesting project in that, along the way, I have found several quite interesting connections. Firstly, I had always been curious about the famous experiment by Herb Simon and collaborators on board memory in chess – does that tell us something of broader relevance to intelligent agents? For instance, watch the clip below, paying special attention to the snippet starting at 1:25.

The kind of mistake he makes – involving only those two rooks – suggests his representation is not just based on compressing his perception (these rooks are by no means the least salient or otherwise least memorable pieces), but is intrinsically driven by the value – along very similar lines, we argue, as action priors.

Subsequently, through a nice kind of serendipity – i.e., conversations with my colleague Matt Nolan, I have come to know about some phenomena associated with place cells as studied by neuroscientists. Apparently, there are open questions about how place cells seem to represent intended destination in goal directed maze tasks. The question we are now trying to figure out is – does the concept of action priors and this way of re-representing space give us a handle on how place cells behave as well?

I’ll give a talk on this topic at the upcoming Spatial Computation Workshop, abstract below.

Priors from learning over a lifetime and potential connections to place and grid cells
– Subramanian Ramamoorthy (joint work with Benjamin Rosman)

An agent tasked with solving a number of different decision making problems in similar environments has an opportunity to learn over a longer timescale than each individual task. Through examining solutions to different tasks, it can uncover behavioural invariances in the domain, by identifying actions to be prioritised in local contexts, invariant to task details. This information has the effect of greatly increasing the speed of solving new problems. We formalise this notion as action priors, defined as distributions over the action space, conditioned on environment state, and show how these can be learnt from a set of value functions.

Applying action priors in the setting of reinforcement learning, using examples involving spatial navigation tasks, we show the benefits of this kind of bias over action selection during exploration. Aggressive use of action priors performs context based pruning of the available actions, thus reducing the complexity of lookahead during search. Additionally, we show how action priors over observation features, rather than states, provide further flexibility and generalisability, with the additional benefit of enabling feature selection.

An alternate interpretation to these priors is as a re-parameterisation of the domain within which the various tasks have been defined. In this sense, these prior distributions bear a resemblance to attributes, such as spatial distribution and firing patterns, of place cells. We conclude by discussing this connection using a variation of the above experiments.

Code Yourself!

My friend, Areti Manataki, is one of the co-organisers of this excellent MOOC on Coursera, entitled “Code Yourself! An Introduction to Programming“. As the blurb on Coursera says, “Have you ever wished you knew how to program, but had no idea where to start from? This course will teach you how to program in Scratch, an easy to use visual programming language. More importantly, it will introduce you to the fundamental principles of computing and it will help you think like a software engineer.”

I like the emphasis on basics, and the desire to reach the broad audience of pre-college children. Many MOOCs I encounter are just college courses recycled. Instead, if MOOCs are to matter, and if they are to matter in the ways MOOCs are ambitiously advertised – i.e., in the developing world and in pursuit of helping new students who would not be otherwise served by existing formal programmes, this is the kind of entry point from which I’d expect to see progress.

I  made a small contribution to this course, by giving a guest interview about our work with the RoboCup project – as a case study. If you go to this course, you’ll find this under Unit 3 as “(Optional Video) Interview on football-playing robots [08:41]“.

On blue sky work…

Useful perspective to keep in mind for the next time one receives unfairly critical comments about speculative work:

Successful research enables problems which once seemed hopelessly complicated to be expressed so simply that we soon forget that they ever were problems. Thus the more successful a research, the more difficult does it become for those who use the result to appreciate the labour which has been put into it. This perhaps is why the very people who live on the results of past researches are so often the most critical of the labour and effort which, in their time, is being expended to simplify the problems of the future.

– Sir Bennett Melvill Jones, British aerodynamicist.

Are you doing what I think you are doing?

This is the title of a paper by my student, Stefano Albrecht, which we have recently submitted to a conference. The core idea is to address model criticism, as opposed to the better studied concept of model selection, within the multi agent learning domain.

For Informatics folks, he is giving a short talk on this paper on Friday at noon, to the Agents group (in IF 2.33). The abstract is below.

 The key for effective interaction in many multi-agent applications is to reason explicitly about the behaviour of other agents, in the form of a hypothesised behaviour. While there exist several methods for the construction of a behavioural hypothesis, there is currently no universal theory which would allow an agent to contemplate the correctness of a hypothesis. In this work, we present an novel algorithm which decides this question in the form of a frequentist hypothesis test. The algorithm allows for multiple metrics in the construction of the test statistic and learns its distribution during the interaction process, with asymptotic correctness guarantees. We present results from a comprehensive set of experiments, demonstrating that the algorithm achieves high accuracy and scalability at low computational costs.

Decision Making when there are Unknown Unknowns

Unknown unknowns are everywhere, a bit like dark matter in the universe. Yet, everything we seem to do in terms of algorithms for learning and inference either assumes a simplified setting that is closed in terms of the hypothesis space (hence not even allowing for these unknown unknowns), or depends on our being able to setup such generally expressive priors that computation is far from tractable. How do real people really bridge the gap? We don’t know, of course, but we have started to take a stab at this from a different direction. With my colleague, Alex Lascarides, and my former student, Benji Rosman, we have been looking into this issue in a specific setting – that of asking how an agent incrementally grows its model to reach the level of knowledge of a more experienced teacher, while dealing with a world that requires our agent to expand its hypothesis space during the process of learning and inference.

This is very much ongoing work, of the kind wherein we have an idea of where we might like to end up (a lighthouse on the horizon) with only a very limited idea of the way there, and the nature of the rocky shores we’ll need to navigate to get there. A status report on the current state of this work, for local Informatics folks, would be an upcoming talk as part if the DReaM talks (10th March, 11:30 am in room IF 2.33) – abstract below.

—————————————————————————————-

Decision Making when there are Unknown Unknowns

Alex Lascarides

Joint work with Ram Ramamoorthy and Benji Rosman

Existing approaches to learning how to solve a decision problem all
assume that the hypothesis space is known in advance of the learning
process.  That is, the agent knows all possible states, all possible
actions, and also has complete knowledge of his or her own intrinsic
preferences (typically represented as a function from the set of
possible states to numeric award).  In most cases, the models for
learning how to behave optimally also assume that the probabilistic
dependencies among the factors that influence behaviour are known as
well.

But there are many decision problems where these high informational
demands on learning aren’t met.  An agent may have to act in the
domain without known all possible states or actions or with only
partial and uncertain information about his or her own preferences.
And yet if one changes the random variables one uses to represent a
decision problem, or one changes the reward function, then this is
viewed as a different and unrelated decision problem.  Intuitively,
one needs a logic of change to one’s decision problem, where change is
informed by evidence.

I will present here some relatively half-baked ideas about how to
learn optimal behaviour when the agent starts out with incomplete and
uncertain information about the hypothesis space: that is, the agent
knows there are `unknown unknowns’.  The model is one where the agent
adapts the representation of the decision problem, and so revises
calculations of optimal behaviour, by drawing on two sources of
evidence: their own exploration of the domain by repeatedly performing
actions and observing their consequences and rewards; and dialogues
with an oracle who knows the true representation of the decision
problem.

Our hypothesis is that an agent that abides by certain defeasible
principles for adapting the representation of the decision problem to
the evidence learns to converge on optimal behaviour faster than an
agent who ignores evidence that his current representation entails the
wrong hypothesis space or intrinsic rewards, or an agent who adapts
the representation of the decision problem in a way that does not make
the defeasible assumptions we’ll argue for here.

Dagstuhl talk – Learning action-oriented symbols

A few months back, I attended a Dagstuhl workshop on Neural-Symbolic Learning and Reasoning – a meeting that tried to bring together people looking at sub-symbolic (primarily but not exclusively neural network) and symbolic (i.e., logic based) learning and  reasoning.

I gave a talk on “Learning Action-oriented Symbols: Abstractions over Decision Processes”. This was an attempt at synthesising the idea behind a couple of different papers we have worked for over the past two years.

The report associated with this workshop has now been released: http://drops.dagstuhl.de/opus/volltexte/2015/4884/. It is an interesting collection of ideas, especially if you follow the links to the primary publications and associated background materials.