**Rich Sutton (May 27, 2019)**

*Open Questions in Model-based Reinforcement Learning*

Reinforcement learning methods that learn a model of the world from their action-observation data, and then plan with that model, may be the next big thing in artificial intelligence. Let us assume a Dyna-like architecture, in which everything---learning, planning, and acting---is done online and continually. There appear to be natural extensions of the original Dyna architecture to stochastic dynamics, function approximation, partial observability, temporal abstraction, and average reward, but there are complications when these extensions are done together, leading to many open questions. Should the model generate samples, or expectations? Should the value function be linear in the state features? If the value function is linear, then should the model be also? How exactly should planning be done with average reward? Should an option model incorporate the average reward estimate? How should planning and state update interrelate? This talk will provide an introduction to these questions.

YouTube

**Shangtong Zhang (May 29, 2019)**

*Generalized Off-Policy Actor-Critic*

We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting. Compared to the commonly used excursion objective, which can be misleading about the performance of the target policy when deployed, our new objective better predicts such performance. We prove the Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient of the counterfactual objective and use an emphatic approach to get an unbiased sample from this policy gradient, yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. We demonstrate the merits of Geoff-PAC over existing algorithms in Mujoco robot simulation tasks, the first empirical success of emphatic algorithms in prevailing deep RL benchmarks.

YouTube Slides

**Martha Steenstrup (May 30, 2019)**

*Reinforcement Learning for Control of Communications Networks*

A communications network is a dynamic distributed system whose behavior depends on the properties of its component devices, the characteristics of the environment in which it operates, the behavior of its users, and the set of control algorithms that determine where, when, and how data is transported among communicating devices. Most network controllers have been engineered based on assumptions about the behavior of users and the operating environment as well as the effects of interactions among controllers. Moreover, network controllers act in response to information about network state that may be delayed, partial, and noisy and often do so autonomously and asynchronously. Our hypothesis is that network controllers that can autonomously learn effective control policies without innate knowledge of detailed behavioral models or supervisory input, make appropriate decisions under uncertainty, and select useful features from observations of network, user, and environment state will be superior to human-engineered solutions in terms of accuracy, agility, robustness, and efficiency. We are in the early stages of testing this hypothesis on a particular network control problem: congestion control. In this talk, we introduce the congestion control problem, identify the challenges to developing effective reinforcement-learning-based solutions, and argue that communications networks are an ideal domain in which to probe the limits of
reinforcement learning.

YouTube

**Harm van Seijen (June 3, 2019)**

*Logarithmic Mapping to Enable Lower Discount Factors in Reinforcement Learning*

YouTube

**Kris De Asis (June 5, 2019)**

*Finite-horizon Temporal Difference Methods*

Reinforcement learning problems are often concerned with an arbitrarily long discounted sum of rewards, denoted as the return. In episodic settings, it is the sum of rewards from from a given point until the end of the episode, and in continuing settings, it is an infinite sum of rewards. The discount rate in the return is typically used to specify a horizon of interest, as it decays the weight given to distant rewards in the sum. We refer to the sum up until the very end (or infinity) as the infinite-horizon return. An appeal of the infinite-horizon return is that it can be expressed recursively, giving rise to algorithms which try to find the fixed-point of self-satisfying relationships. Another appeal is that the complexity of these algorithms does not vary with how far an agent tries to predict. In this talk, we look at estimating finite-horizon returns, where the sum is only considered up to a fixed number of steps into the future. We explore the space of finite-horizon temporal difference methods, including one which scales sub-linearly with horizon. We note properties of the resulting algorithms, and argue some benefits over the infinite-horizon setting in both prediction and control.

YouTube Slides

**Yi Wan (June 6, 2019)**

*Planning with Expectation Models*

This talk discusses the expectation model for model-based reinforcement learning. I will formalize its definition, compare it with other model choices, discuss its different parameterization choices, and propose a new Gradient-based Dyna-style planning algorithm, which has convergence guarantee even if the approximate expectation model is flawed.

YouTube Slides

**Andy Patterson (June 10, 2019)**

*Importance Sampling Ratio Placement for Gradient-TD Methods*

Using the importance sampling ratio to correct samples in off-policy learning can lead to high variance updates and unstable learning. We can reduce the variance in off-policy learning by altering the placement of the importance sampling ratio correction term within the update. In this talk, I will discuss the importance sampling ratio correction placement within the updates for Gradient-TD algorithms. I will show analytically that using the importance sampling ratio to correct the entire TD error term can be viewed as a control variate form of the alternative placement strategy, using the importance sampling ratio to correct only the temporal difference target. Finally, I will empirically support these insights on a simple domain, demonstrating increased stability and faster learning for the control variate form of the importance sampling ratio placement strategy.

YouTube Slides

**Adam Parker (June 12, 2019)**

*Human-Machine Learning Interactions*

Humans are increasingly interacting with not only machines, but machine-learning systems. Thus, there is an increasing need to think about how those interactions will take place and how interactions can be facilitated smoothly and effectively. In rehabilitation science in particular, strong emphasis is placed on human-human and human-technology interaction. Clinicians of all sorts regularly address the needs of their human patients using the “person first” mentality. Using this same person-first viewpoint that is growing in rehabilitation science along with prosthetic limbs as an example rehabilitation robot, we can think about machine-learning problems in new ways. This talk will therefore discuss the reasons, to and benefits of, viewing communication with another agent as actions. At least initially this does not require modifications of current machine-learning methods, but changes in how we think about problems. Rather than learning about the world and how to interact with it, the system can be thought of as learning about the user and how to interact with them. Machine learning agents that determine how to succeed at a task by communicating with each other without specific knowledge of language has been shown in previous studies on emergent communication. There is also a body of research on human-human interaction, termed joint action, which highlights the importance of understanding the other agent in a shared task, as well as non-verbal communication. These ideas are of value to the field of machine-learning and can be used to develop the interactions between humans and machine-learning agents.

YouTube Slides

**Varun Bhatt (June 13, 2019)**

*Training Multiple Intelligent Agents to Communicate*

Communication is one of the key aspects of human survival. Humans rely extensively on communication to both learn quickly and to act efficiently in environments featuring cooperation. As artificial intelligence becomes commonplace in the real world, intelligent agents need to be able to communicate with other learning agents and with humans. In this talk, I will be motivating the need for agents to communicate and talking about the current methods and some of the challenges. In particular, I will be focusing on the issue of credit assignment, since it is not obvious whether the reward was a result of poor signal by the sender or poor actions taken by the receiver.

YouTube Slides

**Khurram Javed (June 17, 2019)**

*Meta-Learning Representations for Continual Learning*

A continual learning agent should be able to build on top of existing knowledge to learn on new data quickly while minimizing forgetting. Current intelligent systems based on artificial neural network function approximators arguably do the opposite— they are highly prone to forgetting and rarely trained to facilitate future learning. The primary reason for this poor behavior is the catastrophic interference problem plaguing neural networks.

In this talk, I will first give a structured overview of the interference problem in neural networks and identify three factors that, when combined together, result in catastrophic interference. I will then present my work done in collaboration with Martha White on learning representations that accelerate future learning and are robust to forgetting under online updates in continual learning. The core idea is to use gradient based meta-learning to treat forgetting as a training signal for learning a representation. I will show that it is possible to learn representations that are more effective for online updating and that sparsity naturally emerges in these representations. Finally, I will demonstrate that a basic online updating strategy with our learned representation is competitive with existing experience replay based methods for continual learning.

YouTube Slides

**Patrick Pilarski (June 19, 2019)**

*The Role of Prediction in Joint Action*

YouTube Slides

**Andrew Jacobsen (June 20, 2019)**

*A Value Function Basis for Multi-step Prediction*

The ability to make accurate predictions about future observations constitutes a fundamental form of awareness and understanding of one's surroundings. An autonomous agent which can make accurate predictions about various aspects of its own sensory-motor stream possesses a valuable tool for reasoning about its environment. In an ideal world, agents could learn to predict all aspects of their sensory-motor stream, but in practice there are limitations on the number of things that can be learned in real time. In this talk, we will discuss how an agent can leverage a small set of learned General Value Function (GVF) predictions to accurately infer the answers to a wide variety of questions about its sensory-motor stream. I will show experimental results in which we accurately infer hundreds of GVF predictions and multi-step predictions about the sensor readings of a mobile robot, using only a small set of learned GVFs. Finally, I will talk about a simple case in which all expected future observations can be completely characterized in terms of a basis of GVF predictions.

Slides

**Martha White (June 24, 2019)**

*Some Thoughts About Learning Predictions Online*

In this talk, I will discuss some of what my group has realized about learning neural networks and representations online, as well as some of my running hypotheses. The goal of the talk is to (a) make concrete statements about conventional wisdom, which is not always written down; (b) clarify some misconceptions and (c) facilitate a discussion on this topic that is relevant to many in ML and AI at the UofA.

YouTube Slides

**Matthew Schlegel (June 26, 2019)**

*Importance Resampling*

Importance sampling (IS) is a common reweighting strategy for off-policy prediction in reinforcement learning. While importance sampling is consistent and unbiased, it can result in high variance updates to the weights for the value function. One can consider a resampling strategy as an alternative to reweighting, avoiding the use of importance sampling ratios in the update. This talk will introduce importance resampling and provide high level conclusions of its use in off-policy prediction through empirical results. Further details can be found in the arxiv paper https://arxiv.org/abs/1906.04328, currently in submission to NeurIPS.

YouTube Slides

**Johannes Gunther (June 27, 2019)**

*General Dynamic Neural Networks for explainable PID parameter tuning in control engineering*

The application of machine learning algorithms in real-world scenarios often results in new and unexpected requirements. While meeting these additional requirements can be challenging, the additional constraints might also provide us with opportunities to evaluate our algorithms within the new context. This talk will give an example of how machine learning can be used in one such environment, namely control engineering. As a field with a high need for safety and explainability, we must extend our analysis beyond simple performance measures and include aspects like (system) stability and explainability. To that end, we examine the utility of extending standard PID controllers with recurrent neural networks—namely, General Dynamic Neural Networks (GDNN). We show that GDNN PID controllers perform well on a range of complex control systems and highlight what is needed to make them a stable, scalable, and interpretable option for modern control systems. As a second main contribution of this work, we address the Achilles heel that prevents neural networks from being used in real-world control processes so far: lack of interpretability. We use bounded-input bounded-output stability analysis to evaluate the parameters suggested by the neural network, thus making them understandable for human engineers. This combination of rigorous evaluation paired with better explainability is an important step towards the acceptance of neural-network-based control approaches for real-world systems.

YouTube

**Valliappa Chockalingam (July 3, 2019)**

*The Role of Interest in Prediction and Control*

By emphasizing and de-emphasizing updates in a particular way, Emphatic Temporal Difference (ETD) methods have been proposed as a way toward stable, convergent off-policy prediction. In the derivation of ETD, the thought-provoking yet largely unexplored idea of an interest function was also introduced. In this talk, I will discuss some empirical experiments with ETD and particularly focus on the role interest functions can play in the prediction and control problem
settings.

YouTube

**Hado van Hasselt (July 4, 2019)**

*When to use parametric models in reinforcement learning?*

We examine the question of when and how parametric models are most useful in reinforcement learning. In particular, we look at commonalities and differences between parametric models and experience replay. Replay-based learning algorithms share important traits with model-based approaches, including the ability to plan: to use more computation without additional data to improve predictions and behaviour. We discuss when to expect benefits from either approach, and interpret prior work in this context. We hypothesise that, under suitable conditions, replay-based algorithms should be competitive to or better than model-based algorithms if the model is used only to generate fictional transitions by planning forward from observed states for an update rule that is otherwise model-free. We validate this hypothesis and attain state-of-the-art data efficiency on the Atari 2600 benchmark domain. We also discuss other ways to use parametric models: for planning backward rather than forward for credit assignment, and for planning forward for behaviour rather than for credit assignment. We hypothesize why we believe these ways of planning may be more effective, and validate these hypotheses empirically.

Slides

**Paniz Behboudian (July 15, 2019)**

In many cases, reinforcement learning (RL) algorithms may take very long to converge, especially if the reward function is sparse. The idea of reward shaping was introduced to speed-up learning with augmenting the original MDP’s reward function R with an arbitrary reward function F. In this talk, I will discuss different ways of reward shaping and address possible problems with these approaches.

Slides

**Csaba Szepesvari (July 17, 2019)**

*Politex -- towards stable and efficient reinforcement learning algorithms*

Experimental evidence suggest that DQN and other model free reinforcement learning (RL) algorithms that use neural networks to represent value functions can find good policies for various challenging control problems. Yet, these algorithms are also often found to be unstable and can be difficult to tune. In this talk I will explain the ideas behind Politex, a new, simple, yet effective model-free RL method that comes with novel theoretical guarantees. The main insight is that model-free algorithms can be made more reliable/stable if the policy improvement step is based on the average of past action-value functions instead of using only the most recent action-value function. The theoretical results are also backed up by preliminary experimental evidence. While more experiments are needed, the simple, intuitive idea behind Politex backed up with strong theoretical guarantees makes me believe that this is a step in the right direction to get to stable and scaleable model-
free methods.

Slides

**Parash Rahman (July 18, 2019)**

*Using Animal Experiments to Design Reinforcement Learning Environments*

This talk will summarize experiments that compare visual discrimination ability between macaques and humans. This talk will then show how these psychological experiments can be used to design computational control and prediction environments with an emphasis on physical and biological fidelity to the experiments. With faithful computational environments, we can better compare our computational agent performance to biological agent performance.

Slides

**Joseph Jay Williams (July 22, 2019)**

*Combining Reinforcement Learning & Human Computation for A/B Experimentation: Perpetually Enhancing and Personalizing User Interfaces*

How can we transform the everyday technology people use into intelligent, self-improving systems? I consider how to dynamically enhance user interfaces by using randomized A/B experiments to integrate Active Learning algorithms with Human Computation. Multiple components of a user interface (e.g. explanations, messages) can be crowdsourced from users, and then compared in real-world A/B experiments, bringing human intelligence into the loop of system improvement. Active Learning algorithms (e.g. multi-armed bandits) can then analyze data from A/B experiments in order to dynamically provide more effective A or B conditions to future users. Active Learning can also lead to personalization, by facing the more substantive exploration-exploitation tradeoff of discovering whether some conditions work better for certain subgroups of user profiles (in addition to discovering what works well on average).
I present an example system, which crowdsourced explanations for how to solve math problems from students and teachers, simultaneously conducting an A/B experiment to identify which explanations other students rated as being helpful. Modeling this as a multi-armed bandit where the arms were constantly increasing (every time a new explanation was crowdsourced) we used Thompson Sampling to do real-time analysis data from the experiment, providing higher rated explanations to future students (LAS 2016, CHI 2018). This generated explanations that helped learning as much as those of a real instructor. Future work aims to discover how to personalize explanations in real-time, by discovering which conditions work for different subgroups of user profiles (such as whether simple vs complex explanations are better for students with different levels of prior knowledge or verbal fluency).
Future collaborative work with statistics and machine learning researchers provides a testbed for a wide range of active learning algorithms to do real-time adaptation of A/B experiments, and integrate with different crowdsourcing workflows. Dynamic A/B experiments can be used to enhance and personalize a broad range of user-facing systems. Examples include modifying websites, tailoring email campaigns, enhancing lessons in online courses, getting people to exercise by personalizing motivational messages in mobile apps, and discovering which interventions reduce stress and improve mental health.

Speaker Bio: Joseph Jay Williams is an Assistant Professor in Computer Science at the University of Toronto. He was previously an Assistant Professor at the National University of Singapore's School of Computing in the department of Information Systems & Analytics, a Research Fellow at Harvard's Office of the Vice Provost for Advances in Learning, and a member of the Intelligent Interactive Systems Group in Computer Science. He completed a postdoc at Stanford University in Summer 2014, working with the Office of the Vice Provost for Online Learning and the Open Learning Initiative. He received his PhD from UC Berkeley in Computational Cognitive Science, where he applied Bayesian statistics and machine learning to model how people learn and reason. He received his B.Sc. from University of Toronto in Cognitive Science, Artificial Intelligence and Mathematics, and is originally from Trinidad and Tobago.

Slides

**Sarath Chandar (July 24, 2019)**

*On Learning Long-term Dependencies in Recurrent Neural Networks*

In a multi-step prediction problem, prediction at each time step can be dependent on the input at any of the previous time steps far in the past. Modeling such long-term dependencies is one of the fundamental problems in machine learning. While Recurrent Neural Networks (RNNs) can, in theory, model any long-term dependency, in practise, they can only model short-term dependencies due to the problem of vanishing gradients. In this talk, I will explore the problem of vanishing gradient in recurrent neural networks and propose new solutions to tackle the same.

Speaker Bio: Sarath Chandar is a final year Ph.D. candidate at Mila, University of Montreal, working with Yoshua Bengio and Hugo Larochelle. He is starting as an Assistant Professor at Polytechnique Montreal and Mila from Fall 2019. His research interests lie at the intersection of Deep Learning, Natural Language Processing, and Reinforcement Learning. His work includes solutions for various fundamental problems in recurrent neural networks and memory augmented neural networks. He also works on several applications in natural language processing, including question answering and dialogue systems.
Sarath is a recipient of the IBM Ph.D. Fellowship 2018-2020 and FQRNT PBEEE scholarship 2016-2018. Sarath has spent time at IBM Research, Twitter, and Google Brain as a research intern. Sarath has co-organized workshops on reinforcement learning and lifelong learning at leading venues like ICML, IJCAI, and RLDM. Sarath has given tutorials on multilingual representation learning at NAACL 2016 and memory augmented neural networks at EMNLP 2017.

**David Silver (July 25, 2019)**

*Reinforcement Learning with Value Equivalent Models*

In this talk I will introduce the value equivalence principle: the idea that a model is equivalent to the real world, for all useful purposes, if the value predictions made using that model match their true values in the real environment. I will show that the value equivalence principle underlies many existing approaches to model-based RL, from planning with linear value functions and expectation models, to the predictron, value prediction networks, and value iteration networks, and perhaps also opens up new ways to think about model-based RL.

**Yoshua Bengio (August 1, 2019)**

*Learning High-Level Representations for Agents*

YouTube

**Alex Lewandowski (August 7, 2019)**

*Learning and using the return distribution in off-policy policy optimization*

In off-policy policy optimization, we usually optimize the expected return of a policy given trajectories from another policy. Due to REINFORCE and importance correction, the variance of the gradient can be large. For this talk, I will outline different ways of estimating the return distribution while learning the target policy. Using these estimates of the return distribution, we can leverage objectives from supervised learning like the KL divergence, in addition to all-action policy gradients. This formulation avoids REINFORCE and I show analytically that it can reduce variance. I will also demonstrate these results empirically on a simple but challenging problem: cartpole with a uniformly random behavior policy.

YouTube Slides

**Katya Kudashkina (August 8, 2019)**

*Model-Based Actor-Critic for an Interactive Dialogue Task*

Interactive dialogue systems are becoming paramount to the lives of millions of people who use voice-assistive devices on a daily basis. Yet, further advances are limited by the availability of data and the cost of acquiring new samples. We present a model-based reinforcement learning algorithm for interactive dialogue, specifically for voice-based document editing. We build on commonly used actor critic methods, yielding a model that augments a learning agent and allows understanding dialogue dynamics. Our results show that our algorithm requires 70× fewer samples than the commonly used model-free architecture, and demonstrates 2× better performance asymptotically.

YouTube

**Banafsheh Rafiee (August 12, 2019)**

There seem to be three challenges when it comes to constructing the state: 1- How to build a representation of the world suitable for learning. 2- How to learn if our observations do not carry enough information. 3- How to handle non-linear functions in linear representation systems. All these challenges involve the state-update function. In this talk, I introduce a simple problem that may be suitable for investigating the state-update function: a trace conditioning
testbed.

YouTube Slides

**Shibhansh Dohare (August 14, 2019)**

*What is the state representation problem?*

In this talk, I will try to explain some of the nuances of the state representation problem and what makes it so hard. I'll define various types of states from the environment states to the agent state and everything in between. I'll then explain what will be some of the desirable properties for our solution.

Slides

**Abhishek Naik (August 15, 2019)**

*Discounting – does it make sense?*

In continuing problems, a discount factor is commonly used to ensure that the potentially-infinite return per state is a finite number. In this talk, we will discuss how this problem setting is problematic, and how the average reward formulation is a viable alternative.

YouTube

**Eric Graves (August 19, 2019)**

*Visit Distribution Corrections: A lower-variance approach to off-policy learning*

Off-policy Reinforcement Learning allows an agent to learn from the experience of others, improve data efficiency, and learn offline when safety is critical. However, it is often described as having inherent variance issues. In this talk I’m going to present an alternative approach to off-policy corrections that doesn’t have these variance issues, thereby showing that they are due to the conventional solution method and not the problem itself.

YouTube Slides

**Raksha Kumaraswamy (August 21, 2019)**

*State Representations for Metrics in RL*

In this talk, I will discuss the State Representation problem from the perspective of Metric Learning, and introduce a general supervised solution framework to learn representation spaces which reflect the metrics in the supervision space. Following this, I will highlight additional desirable properties that these spaces should have, along with methods to enforce them.

Slides

**Yangchen Pan (August 22, 2019)**

*An implicit function learning approach for regression*

Regression algorithms typically require some probabilistic assumptions on the conditional distribution of the target(response) variable given an input, which can restrict the utility of an algorithm due to different underlying distributions of datasets. It is particularly troublesome in the case where the underlying mapping being learned is multi-valued (i.e., the conditional distribution of target given input is multimodal). We propose a novel and simple algorithm by implicit function learning, where the mapping from the input to target is implicitly learned.

**Yash Satsangi (August 26, 2019)**

Deep reinforcement learning (RL) methods enable an agent to learn a control policy directly from its experience. Traditionally, the reward in a deep RL task is modeled as a function of the state action pairs. However, many real-world tasks such as active perception require the reward to be a function of the belief of the agent, for example, a measure of the uncertainty. Generally, a deep RL agent does not maintain an explicit belief instead learns a representation of it from the experience of the agent, which hinders the application of deep RL methods for active perception tasks. In this talk, we present deep anticipatory networks (DANs) that allow an agent to take actions to reduce its uncertainty without maintaining an explicit belief about the world. A DAN agent consists of a deep Q network (Q network) and a model network (M network). The role of the Q network is to take the sensory actions based on which the M network predicts the state of the world. The two networks are trained simultaneously so that Q network is rewarded for taking sensory actions that help the M network predict the state of the world. We present theoretical results that provides the insight that such a network in principle leads to maximization of long term information gain of the agent. We present multiple real-life applications of DANs for building a sensor selection system for tracking people in a shopping mall and as discrete models of attention on MNIST digit
classification.

Slides

**Alexandra Kearney (August 28, 2019)**

*Is This a Good General Value Function? A tragedy in three parts*

Constructing and maintaining knowledge of the world is a central problem for Artificial Intelligence research. Predictive approaches to constructing an agent’s knowledge of its environment have received increasing amounts of interest in recent years. A particularly promising collection of research centers itself around architectures that formulate predictions as General Value Functions (GVFs), an approach commonly referred to as predictive knowledge. A pernicious challenge for predictive knowledge architectures is determining what to predict. In this talk, we take a look at evaluation methods typically associated with General Value Functions and explore how common evaluation metrics can mislead us in determining what to predict.

*The 2019 tea time talks were coordinated by Sheila Schoepp (sschoepp AT ualberta DOT ca).*