**Rich Sutton (June 8, 2020)**

*Are You Ready to Fully Embrace Approximation?*

Approximation that scales with computational resources is what drives modern machine learning. The steady drumbeat of Moore’s law enables successes, such as those of deep learning and AlphaGo, that depend on scalable approximation, and will continue to do so for the foreseeable future. Are we ready to be part of this future? Fully embracing approximation imposes a challenging discipline under which we must do without so much of what reinforcement learning takes for granted, including:

- optimal policies
- the discounted control objective
- Markov state, and therefore:
- all probabilities and expectations
- all true value functions
- the mean square Bellman error
- the mean square value error
- convergence to anything
- off-line learning
- mapping from environment state to feature vectors.

YouTube

**Csaba Szepesvari (June 9, 2020)**

*Embracing approximation in RL: The perspective of a theorist*

In his TTT talk, Rich brought up interesting points about what may one need to give up by fully embracing approximations. Approximations, needless to say, are central to everything that we do in RL and also play a major role in computer science. In this talk, I will discuss what results are already available and also my take on how to pursue meaningful research goal in RL when you have no choice but fully embracing approximations.

YouTube

**Patrick Pilarski (June 10, 2020)**

*On Time*

Time is fundamental to reinforcement learning. The literature to date has described many ways that animals and machines use aspects of the flow of time and temporal patterns to make predictions, inform decisions, process past experiences, and plan for the future. In this talk, I will begin with a survey of how and why agents perceive and represent time, as selected from the animal learning and neuroscience literature. I will then suggest what I think is a desirable set of time-related abilities for machine agents to acquire, demonstrate, and master as they continually interact with the environment around them.

YouTube

**Martha White (June 11, 2020)**

*Policy Gradient Methods as Approximate Policy Iteration: Advantages and Open Questions*

Many policy gradient methods can be well thought of as approximate policy iteration (API). This view is not new, but new questions arise under function approximation, when using parameterized policies. I will explain the interpretation of policy gradient methods as API, where the policy update corresponds to an approximate greedification step. This policy update can be generalized by considering other choices for greedification. I'll provide a few insights we have, both empirically and theoretically, about good choices for this approximate greedification. Many open questions remain; I hope for us to have a discussion about this API view of policy gradient methods and these open questions.

YouTube

**Roshan Shariff (June 15, 2020)**

*Efficient Planning in Large MDPs with Weak Linear Function Approximation*

Large-scale Markov decision processes (MDPs) require planning algorithms with runtime independent of the number of states of the MDP. We solve the planning problem in MDPs using linear value function approximation with only weak requirements: low approximation error for the optimal value function, and a small set of “core” states whose features span those of other states. In particular, we make no assumptions about the representability of policies or value functions of non-optimal policies. Our algorithm produces almost-optimal actions for any state using a generative oracle (simulator) for the MDP, while its computation time scales polynomially with the number of features, core states, and actions and the effective horizon.

YouTube

**Matt Taylor (June 16, 2020)**

*Assisting RL with External Information: Can you Help an Agent Out?*

When we think about deploying RL algorithms to real-world problems, we often can’t generate years worth of data with a simulator (e.g., OpenAI Five). What is an agent to do? This talk will highlight some of the ways we can leverage existing data, existing agents, and people, to motivate why we should be thinking about helping our agent succeed as quickly as possible. I’ll also be interested in the audience’s take on whether I am doomed to relearn my own Bitter Lesson.

YouTube

**Martin Mueller (June 17, 2020)**

*Some Old Ideas about Search and Learning*

YouTube

**Sina Ghiassian (June 18, 2020)**

*Gradient Temporal-Difference Learning with Regularized Corrections*

**Kris De Asis (June 22, 2020)**

*Inverse Policy Evaluation for Value-based Decision-making*

In this work, we explore inverse policy evaluation, the process of solving for a likely policy given a value function, as a method for deriving behavior from a value function.

YouTube

**Andy Patterson (June 23, 2020)**

*Objective Function Geometry for Learning Values*

A brief discussion on the distribution of prediction error when learning value functions that minimize a few popular objective functions in RL. Some familiarity with chapter 11 of Sutton Barto will be useful. Much of the talk will be derived from Chapter 11 of the textbook, Schoknecht 2003, and Scherrer 2010.

YouTube

**Junfeng Wen (June 24, 2020)**

*Batch Stationary Distribution Estimation*

We consider the problem of approximating the stationary distribution of an ergodic Markov chain given a set of sampled transitions. Classical simulation-based approaches assume access to the underlying process so that trajectories of sufficient length can be gathered to approximate stationary sampling. Instead, we consider an alternative setting where a fixed set of transitions has been collected beforehand, by a separate, possibly unknown procedure. The goal is still to estimate properties of the stationary distribution, but without additional access to the underlying system. We propose a consistent estimator that is based on recovering a correction ratio function over the given data. In particular, we develop a variational power method (VPM) that provides provably consistent estimates under general conditions. In addition to unifying a number of existing approaches from different subfields, we also find that VPM yields significantly better estimates across a range of problems, including queueing, stochastic differential equations, post-processing MCMC, and off-policy evaluation.

YouTube

**Vincent Liu (June 25, 2020)**

*Towards a practical measure of interference for reinforcement learning *

Catastrophic interference is common in many network-based learning systems, and many proposals exist for mitigating it. But, before we overcome interference we must understand it better. In this work, we provide a definition of interference for control in reinforcement learning. We systematically evaluate our new measures, by assessing correlation with several measures of learning performance, including stability, sample efficiency, and online and offline control performance across a variety of learning architectures. Our new interference measure allows us to ask novel scientific questions about commonly used deep learning architectures. In particular we show that target network frequency is a dominating factor for interference, and that updates on the last layer result in significantly higher interference than updates internal to the network. This new measure can be expensive to compute; we conclude with motivation for an efficient proxy measure and empirically demonstrate it is correlated with our definition of interference.

**Negar Hassanpour (June 29, 2020)**

*Counterfactual Reasoning in Observational Studies*

In this talk, I will present some ways to address certain critical challenges associated with counterfactual reasoning for causal effect estimation.

YouTube

**Yangchen Pan (June 30, 2020)**

*An implicit function learning approach for parametric modal regression*

For multi-valued functions---such as when the conditional distribution on targets given the inputs is multi-modal---standard regression approaches are not always desirable because they provide the conditional mean. Modal regression algorithms address this issue by instead finding the conditional mode(s). Most, however, are nonparametric approaches and so can be difficult to scale. Further, parametric approximators, like neural networks, facilitate learning complex relationships between inputs and targets. We propose a parametric modal regression algorithm. We use the implicit function theorem to develop an objective, for learning a joint function over inputs and targets.

**Rory Dawson (July 6, 2020)**

*Adaptive Switching for Improved Control of Robotic Protheses*

We have developed a method called Adaptive Switching that uses real-time predictions from general value functions to improve control of upper limb prostheses. In this talk we will summarize our previous work in this area and also suggest some ideas for future work and additional application areas that may benefit from the technique.

YouTube

**Connor Stephens (July 7, 2020)**

*How to Sample When No One's Watching: Open Problems in Bandit Exploration*

I'll discuss an online learning setting in which an agent adaptively samples rewards from a finite set of distributions, but is only concerned with the reward from their final selection. I'll provide a brief overview of applications and past work in this area before discussing some open problems.

YouTube

**Yunshu Du (July 9, 2020)**

*Lucid Dreaming for Experience Replay: Refreshing Past States with the Current Policy*

Experience replay (ER) improves the data efficiency of off-policy reinforcement learning (RL) algorithms by allowing an agent to store and reuse its past experiences in a replay buffer. While many techniques have been proposed to enhance ER by biasing how experiences are sampled from the buffer, thus far they have not considered strategies for refreshing experiences inside the buffer. In this work, we introduce Lucid Dreaming for Experience Replay (LiDER), a conceptually new framework that allows replay experiences to be refreshed by leveraging the agent’s current policy. LiDER 1) moves an agent back to a past state; 2) from there lets the agent try following its current policy to execute different actions—as if the agent were “dreaming” about the past, but is aware of the situation and can control the dream to encounter new experiences; and 3) stores and reuses the new experience if it turned out better than what the agent previously experienced, i.e., to refresh its memories. LiDER is designed to be easily incorporated into off-policy, multi-worker RL algorithms that use ER; we present in this work a case study of applying LiDER to an actor-critic based algorithm. Results show LiDER consistently improves performance over the baseline in four Atari 2600 games. Our open-source implementation of LiDER and the data used to generate all plots in this paper are available at [reveal after review period].

**Parash Rahman (July 13, 2020)**

*Stochastic Gradient Descent in a Changing World*

In this talk, I will discuss the online learning setting where a predictor learns to predict from a stream of data. This problem setting is important for real-world applications that wish to handle the inevitable changes of the real world. I will discuss the surprising mediocre adaptability of multilayer networks updated with stochastic gradient descent, which have otherwise been successful in modern applications. Finally, I will recommend the use of generate-&-test algorithms to improve performance in the predictor's future.

YouTube

**Yufeng Yuan (July 14, 2020)**

*Multimodal Observation Space for Robot Learning*

In this talk, we will explain what is multimodal observation space for robot learning and how it’s different from other commoner-used observation space like pixel observation. We then introduce some useful trick for learning completely from pixels, and investigate their effectiveness in the multimodal setting.

YouTube

**Han Wang (July 15, 2020)**

*Emergent Representations in Reinforcement Learning and Their Properties*

Representation learning remains one of the central challenges in reinforcement learning. Earlier representation learning work focuses on designing fixed-basis architectures to achieve desirable properties. However, several recent work suggests that representations emerge under appropriate training schemes, thus, the properties should be determined by the data stream. In this work, we explore properties of representations trained end-to-end with different auxiliary tasks, provide novel insights regarding the auxiliary task effect, and investigate the relationship between properties and transfer learning performance.

YouTube

**Martha Steenstrup (July 16, 2020)**

*Control of Communications Networks: Tryin' to Make it RL Compared to What?*

To design effective algorithms for controlling the performance of a communications network, one must confront several challenges relating to environment dynamics, fidelity of observations, responsiveness, appropriate credit assignment, and cost-benefit trade-offs. We argue that reinforcement learning (RL) algorithms are well-suited to surmount these challenges, with supporting evidence garnered from our application of RL to the problem of congestion control, and we posit that rules of thumb for RL application, emerging from our case study, will generalize to other similar control problems.

YouTube

**Matthew Schlegel (July 20, 2020)**

*A first look at hierarchical predictive coding*

Predictions, specifically those of general value functions, are of continued interest to the RLAI lab leading to many lines of research and thought. While there have been many new algorithms for learning GVFs in recent years, there are still many questions around the use of GVFs. Hierarchical predictive coding (Rao, 1999) is a scheme that uses predictions to inhibit feed-forward signals through corrective feedback. It has garnered considerable interest in computational neuroscience communities and several challenges. In this talk, I will introduce the core concepts of hierarchical predictive coding. If time permits, I will also discuss an instantiation of the hierarchical predictive coding model using techniques from deep learning.

YouTube

**Alex Lewandowski (July 21, 2020)**

*Temporal Abstraction via Recurrent Neural Networks*

Environments come preconfigured with hyper-parameters, such as discretization rates and frame-skips, that determine an agent's window of temporal abstraction. In turn, this temporal window influences the magnitude of the action gap and greatly impacts learning. I will discuss ongoing work that uses a recurrent neural network to flexibly learn action sequences within a temporal window.

YouTube

**Shibhansh Dohare (July 22, 2020)**

*The Interplay of Search and Gradient Descent in Semi-stationary Learning Problems*

We explore the interplay of generate-and-test and gradient-descent techniques for solving supervised learning problems. We start by introducing a novel idealized setting in which the target function is stationary but much more complex than the learner, and in which the distribution of input is slowly varying. Then, we show that if the target function is more complex than the approximator, then tracking is better than any fixed set of weights. And finally, we find that conventional backpropagation performs poorly in this setting, but its performance can be improved if we use random-search to replace low utility features.

YouTube

**Dhawal Gupta (July 23, 2020)**

*Optimizations for TD*

I will talk about the possibility of using adaptive stepsize techniques from the Deep learning community for the use of Temporal Difference Learning. Does the adaptive step size methods offer some respite in divergence issues in the TD learning, mainly because of behavioural and target policy mismatch? We discuss the same on a small example using Bairds Counter Example. This is more of a proposal talk where I would like to discuss possible approaches to study the problem and what potential steps we could take. Is this even something which we should look into or should we develop completely separate step size techniques for TD learning?

YouTube

**Khurram Javed (July 27, 2020)**

*Learning Causal Models Online*

Online learning is an essential property of an intelligent system. Unlike an offline learned system, an online learning system can adapt to changes in the world. Moreover, if the learner has limited capacity, online tracking can achieve better performance even in a stationary world. However, online learning has yet to see the same level of success as batch learning has seen over the past decade. More specifically, a scalable online representation learning method has remained elusive.

In this talk, I will first give an overview of the online representation learning problem. I will then go over some of my recent work for discovering causal models online and propose a metric for detecting spurious features online. This metric can be combined with an online representation search algorithm to discover non-spurious features from sensory data. Finally, I will argue that by continually removing spurious features online, we can learn models that have strong generalization.

YouTube

**Alan Chan (July 28, 2020)**

*Problems with Fair ML*

This talk will be a mostly non-technical dive into problems that I find with a lot of fair ML research today. I will begin with some context, provide a characterization of fair ML, go through scenarios to tease out the problems with this characterization, and conclude with some closing questions to improve upon the work being done.

YouTube

**Raksha Kumaraswamy (July 29, 2020)**

*Stochastic Optimism and Exploration*

A predominant theme underlying many methods to promote exploratory behaviour in Reinforcement Learning is the idea of optimism. In this talk, we will take a closer at a concrete instantiation of the idea through the lens of Stochastic Optimism. I will define Stochastic Optimism, and describe the framework within which the concept has been proposed, in the literature, to induce effective exploratory behaviour in Reinforcement Learning.

YouTube

**Qingfeng Lan (August 4, 2020)**

*Predictive Representation Learning for Language Modeling*

To effectively perform the task of next-word prediction, Long Short Term Memory networks (LSTMs) must keep track of many types of information. Some information is directly related to the next word's identity, but some is more secondary (e.g. discourse-level features or features of downstream words). Correlates of secondary information appear in LSTM representations even though they are not part of an explicitly supervised prediction task. In contrast, Reinforcement Learning (RL) has found success in techniques that explicitly supervise representations to predict secondary information.Inspired by that success, we propose Predictive Representation Learning (PRL), which explicitly constrains LSTMs to encode specific predictions, like those that might need to be learned implicitly. We show that PRL 1) significantly improves two strong language modeling methods, 2) converges more quickly, and 3) performs better when data is limited. Our fusion of RL with LSTMs shows that explicitly encoding a simple predictive task facilitates the search for a more effective language model.

YouTube

**Banafshe Rafiee (August 6, 2020)**

*Classical Conditioning Testbeds for State Construction*

In this talk, I will introduce classical conditioning testbeds for studying the problem of state construction. These testbeds are modelled after tasks in psychology where an animal is exposed to a sequence of stimuli and has to construct an understanding of its state in order to predict what will happen next. The testbeds are proposed to study online multi-step prediction. I will provide results on the first testbed, characterizing a multitude of approaches including the common modern approaches as well as simpler methods inspired by models in animal learning.

YouTube

**Abhishek Naik (August 10, 2020)**

*Learning and Planning in Average-Reward MDPs*

In this talk, I will talk about a family of new learning and planning algorithms for average-reward MDPs. Key to these algorithms is the use of the TD error to update the reward rate estimate instead of the conventional error, enabling proofs of convergence in the general off-policy case without recourse to any reference states. Empirically, this generally results in faster learning, while reliance on a reference state generally results in slower learning and risks divergence. I will also present a general technique to estimate the actual centered value function rather than the value function plus an offset.

YouTube

**Ashley Dalrymple (August 11, 2020)**

*Pavlovian Control of Walking*

Spinal cord injury can cause paralysis of the legs. Restoring the ability to walk is of high importance to people with paralysis. In this talk I will introduce a spinal cord implant that our lab used to generate walking in a cat model. I will then describe how we used general value functions and Pavlovian control to produce highly adaptable over-ground walking. Come on out if you want to hear about how RL methods can be used to solve real world medical problems!

YouTube

**Alex Ayoub (August 12, 2020)**

*Model-Based Reinforcement Learning with Value-Targeted Regression*

I will discuss a model based RL algorithm that is based on optimism principle: In each episode, the set of models that are consistent with the data collected is constructed. The criterion of consistency is based on the total squared error of that the model incurs on the task of predicting values as determined by the last value estimate along the transitions. The next value function is then chosen by solving the optimistic planning problem with the constructed set of models.

YouTube

**Shivam Garg (August 13, 2020)**

*Log-likelihood Baseline for Policy Gradient*

Policy gradient methods have a critic baseline to reduce the variance of their estimate. In this talk, we will discuss a simple idea for an analogous baseline for the log-likelihood part of the policy gradient. First, we will show that the softmax policy gradient in the case of bandits can be written in two different but equivalent expressions, which will motivate the log-likelihood baseline. One of these expressions is the regular expression which is widely used and the other one doesn't seem to be popular (or even present?) in the literature. We will then show how these expressions can be extended to the full MDP case under certain assumptions.

YouTube

**Kirby Banman (August 17, 2020)**

*Regression nonstationarities as dynamical systems*

Many supervised learning algorithms are designed to operate under i.i.d sampling. When those algorithms are applied to problems with nonstationary sampling, they can misbehave. Of course, this is not surprising if one takes time to understand the conditions under which an algorithm's behaviour is (or is not) guaranteed. Dynamical systems analysis offers us some tools to extend those guarantees to certain kinds of nonstationary sampling. This talk will exemplify these ideas in a simple setting: optimizing linear regression models with SGD+momentum under periodic simple nonstationarity.

YouTube

**Manan Tomar (August 18, 2020)**

*Multi-step Greedy Reinforcement Learning Algorithms*

Multi-step greedy policies have been extensively used in model-based reinforcement learning (RL), both when a model of the environment is available (e.g.,~in the game of Go) and when it is learned. In this paper, we explore their benefits in model-free RL, when employed using multi-step dynamic programming algorithms: $\kappa$-Policy Iteration ($\kappa$-PI) and $\kappa$-Value Iteration ($\kappa$-VI). These methods iteratively compute the next policy ($\kappa$-PI) and value function ($\kappa$-VI) by solving a surrogate decision problem with a shaped reward and a smaller discount factor. We derive model-free RL algorithms based on $\kappa$-PI and $\kappa$-VI in which the surrogate problem can be solved by any discrete or continuous action RL method, such as DQN and TRPO. We identify the importance of a hyper-parameter that controls the extent to which the surrogate problem is solved and suggest a way to set this parameter. When evaluated on a range of Atari and MuJoCo benchmark tasks, our results indicate that for the right range of $\kappa$, our algorithms outperform DQN and TRPO. This shows that our multi-step greedy algorithms are general enough to be applied over any existing RL algorithm and can significantly improve its performance.

YouTube

**Robin Ranjit Singh Chauhan (August 19, 2020)**

*TalkRL and Other Projects*

Robin will share some highlights and learnings from a year of interviewing RL researchers on TalkRL podcast. If there is time, we can take a quick look at a few of his current and past RL-related projects.

YouTube

**Fernando Hernandez Garcia (August 20, 2020)**

*The Cascade-Correlation Learning Architecture: The Forgotten Network*

In 1990, Scott E. Fahlman and Christian Lebiere proposed a constructive neural network architecture, named the cascade-correlation architecture, as an alternative to training deep neural networks with fixed architectures using backpropagation. Despite showing promising results and spurring several follow up papers, it does not enjoy a lot of popularity nowadays in the deep learning community. In this talk, I will revisit the cascade-correlation in an attempt to answer the question: why is it not popular anymore? In the process, I'll present several empirical results that demonstrate the performance of the cascade-correlation under several settings and in different domains. I will follow up with a discussion about several disadvantages of the cascade-correlation that have been found in the literature, but also several extensions that have been proposed to address each of them. Finally, I will conclude by arguing about why we should care about the cascade-correlation in our group.

YouTube

**Matthew McLeod (August 24, 2020)**

*Intrinsically Motivated GVF Agent*

Intrinsic Motivation and GVFs are two exciting areas in the field of RL. In this talk, we will discuss the intersection of these two subfields and why they may be complementary to each other. We will analyze this problem with a tabular MDP and discuss some interesting initial results.

YouTube

**Shiva Soleimany (August 26, 2020)**

*Improving Sim-to-real transfers using computational creativity*

The talk is about narrowing down the reality gap using an adversarial agent that generates creative novel environments.

YouTube

**Katya Kudashkina (August 27, 2020)**

*Model-based reinforcement learning with one-step expectation models*

A one-step expectation model of the environment dynamics produces an estimate of the expected next state. This is less general than estimating the full distribution of possible next states, or a random sample thereof, and more general than modeling the world as deterministic. Expectation models are limited in the kinds of planning operations and value-function approximations they can use, but are well suited to being learned. We discuss what is known about expectation models in the context of model-based reinforcement learning and states that are non-Markov. We show that planning with expectation models can be done only with state values and not action values.

YouTube

*The 2020 tea time talks were coordinated by Abhishek Naik (anaik1 AT ualberta DOT ca).*