**Matt Taylor (June 15, 2022)**

*Humans: Who needs them?*

Reinforcement learning is amazing. But where does the MDP actually come from? I’m going to argue in this that humans are critical to the RL lifecycle. Moreover, I’ll argue that significant improvements to existing explainability techniques are required to fully invest humans, whether lay people, subject matter experts, or machine learning experts, with the ability required to drive this human-agent interaction for Human-In-The-Loop RL.

**Rupam Mahmood (June 16, 2022)**

Today I will briefly describe different research areas our group is working on. Then I will discuss the research theme that unifies all these areas. Specifically, I will talk about the long-term scientific goal of understanding the fundamental underlying general principles that govern the behavior of a freely living animal by constructing one. To some extent, many in the field of AI are interested in the quest for “general intelligence.” But somewhat disappointingly, I will argue that this specific goal comes with many essential but oft-ignored constraints and thus is different than the mainstream quest for general intelligence. I will list the constraints with the hope that we can discuss one or two today. And if time allows I will also argue that these constraints may lead to better systems and learning mechanisms for practical use.

**Alex Lewandowski (June 21, 2022)**

*Value-Based Meta-Learning*

I will argue that reinforcement learning provides the right toolset for the general problem of meta-learning. This talk will present some of my recent work that interprets iterative learning algorithms as Markov Reward Processes. The value function of this process, or Meta Value Function (MVF), predicts the future performance of a model. This value-based perspective on meta-learning is more general than meta-gradients and avoids costly differentiation through many steps of gradient descent. I will demonstrate, in a simple multi-task regression problem, that MVFs can be learned to accurately predict performance far into the future. I will conclude by showing how to simultaneously learn the MVF and use it to meta-learn the model's initialization.

**Khurram Javed (June 22, 2022)**

*Practical Hyper-parameter free Online Linear Learning*

In this talk, I will argue that algorithms that require hyper-parameter tuning are ineffective for continual learning in large worlds, and that we need to design algorithms that are either hyper-parameter free, or can adjust their hyper-parameters using experience. I will then show that even in the simple linear supervised learning setting, we do not have a robust algorithm that does not require tuning the step-size parameter. I will demonstrate why existing step-size normalization techniques, such as Adam and RMSProp, are insufficient, and show that feature normalization combined with meta-gradients can result in a linear learner that works well for the same choice of the initial step-size. Finally, I will share some modifications to make the meta-learning algorithm robust to the value of the meta-step-size parameter. The resultant system performs well on a wide range of linear supervised learning problems without extensive hyper-parameter tuning and can act as an important building block for larger non-linear learning systems.

**Liam Peet-Pare (June 23, 2022)**

*Fairness for Minority Groups via Distributionally Robust Performative Prediction*

In recent years machine learning (ML) models have begun to be deployed at enormous scales, but too often without adequate concern for whether or not an ML model will make fair decisions. Fairness in ML is a burgeoning research area, but work to define formal fairness criteria has some serious limitations. This work aims to combine and explore two areas of research in ML – distributionally robust optimization (DRO) and performative prediction – in an attempt to resolve some of these limitations. Performative prediction is a recent framework developed to understand the effects of when deploying a model influences the distribution on which it is making predictions, an important concern for fairness. Research on performative prediction has thus far only examined risk minimization, however, which has the potential to result in discriminatory models when working with heterogeneous data composed of majority and minority subgroups. We examine performative prediction with a distributionally robust objective instead. We discuss convergence and fairness properties of performative prediction with a DRO objective as compared to a risk minimization objective.

**Andy Patterson (June 28, 2022)**

*Confidently Incorrect - Challenges in experiment design for RL*

Designing proper experiments to study reinforcement learning agents and algorithms is challenging! We've all heard that we should be testing our algorithms with far more than 3 random seeds, but why? In this talk, I will be revisiting some rudimentary statistics and their impact on RL algorithm analysis. I will motivate that the dogma of "more seeds!" is necessary, but not sufficient by illustrating a common failure mode where more data can harm conclusions.

**Vincent Liu (June 29, 2022)**

*From Survey Sampling to Non-stationary Off-Policy Policy Evaluation*

We consider off-policy policy evaluation (OPE) for contextual bandits in the non-stationary setting. We first introduce and draw the connection between OPE and a related field called survey sampling, which considers estimating unknown parameters of a population. Inspired from survey sampling, we introduce a variant of the popular doubly robust (DR) estimator, called the regression-assisted DR estimator, that unifies several existing OPE methods and improves on them with the use of auxiliary information and a regression approach. We prove several asymptotic properties of the estimator and empirically show that the estimator provides a tight and valid interval estimation with finite data in non-stationary environments.

**Yuxin Liu (June 30, 2022)**

*Language Model gives a reward function*

The conventional training of language models is known to have exposure bias issues, and Reinforcement Learning is a popular way to alleviate it. However, the reward function is not clear for text generation. In this talk, we show that the language model training implicitly gives a reward function for the task it is trained on. The reward function derived from a language model assigns each token a real-value reward in sequence generation. By applying this reward function, we can further improve the language model using RL and could outperform other methods.

**Yi Wan (July 5, 2022)**

*Toward Discovering Options that Achieve Faster Planning*

I will introduce a new objective for option discovery that emphasizes the computational advantage of using options in planning. For a given set of episodic tasks and a given number of options, the objective prefers options that can be used to achieve a high return by composing few options. By composing few options, fast planning can be achieved. When faced with new tasks similar to the given ones, the discovered options are also expected to accelerate planning. An algorithm can be designed to maximize the objective.I will show, in the four room domain, some empirical results of the algorithm.

**Haseeb Shah (July 7, 2022)**

*Online Feature Decorrelation*

A significant proportion of the representations learned by the current generate & test methods consist of highly redundant features. In this talk, I will demonstrate how the feature ranking criteria utilized by these methods are highly ineffective in addressing this problem, and present two approaches for decorrelating features in an online setting. Empirical evidence suggests that these decorrelators are able to eliminate redundant features as well as produce a statistically significant improvement in performance when used to learn representations in the low-capacity function approximation setting.

**Prabhat Nagarajan (July 12, 2022)**

*On Directing Behavior to Learn Collections of Subtasks*

This talk explores the problem of learning a collection of subtasks in parallel from a single stream of experience. The motivating hypothesis is that successful goal-achieving continual learning agents in complex environments are ones that can perform tasks and make predictions within their environment. The problem of gathering experience for learning multiple subtasks can be formalized as a large Markov decision process. We simplify this problem setting into a tractable non-stationary reinforcement learning problem. In this setting, we have a behavior learner which learns to act in the environment and acquire experiences. We then have several subtask learners use these acquired experiences to update their parameters off-policy in parallel. The behavior learner is rewarded with a surrogate measure of subtask learning progress, weight change. We also propose a new method, non-stationary replay as a potential improvement in this non-stationary setting.

**Matthew Schlegel (July 13, 2022)**

*Predictions Predicting Predictions*

Predicting the sensorimotor stream has consistently been a key component for building general learning agents. Whether through predicting a reward signal to select the best action or learning a predictive world model with auxiliary tasks, prediction making is at the core of reinforcement learning. One of the main research directions in predictive architectures is in the automatic construction of learning objectives and targets. The agent can consider any real-valued signal as a target when deciding what to learn, including the current set of internal predictions. A prediction whose learning target is another prediction is known as a composition. Arbitrarily deep compositions can lead to learning objectives that are unstable or not suitable for function approximators. In this talk, we will discuss the work presented in RLDM about predictions predicting predictions which predict predictions.

**Fengdi Che (July 14, 2022)**

*Averaged Discount Factor Correction in the State Visitation Distribution*

The policy gradient theorem gives a convenient form of the policy gradient in terms of three factors: an action value, a gradient of the action likelihood, and a state distribution involving discounting called the discounted stationary distribution. But commonly used on-policy policy gradient estimators based on the policy gradient theorem account for the undiscounted state visitation distribution or the so-called state stationary distribution. Although these estimators work well on many tasks in practice, ignoring the discount factor in the state distribution may cause convergence to a sub-optimal policy. An existing solution corrects this discrepancy by using powers of the discount factor in the gradient estimate. Yet this solution is not widely adopted and does not work well in cases where the later states are similar to earlier states. We introduce a new way of accounting for the discounted stationary distribution that can be plugged in easily to many existing estimators to make them technically correct and also of a lower variance than the existing correction. We empirically show on the CartPole task that the inclusion of our corrective estimator avoids the performance degradation caused by the existing correction.

**Muhammad Kamran Janjua (July 26, 2022)**

*On Marriage of Offline & Online Reinforcement Learning*

Offline-Online Reinforcement Learning is a relatively new paradigm where an offline learned agent is allowed to adapt online. The goal is to allow faster convergence with minimal updates required. This setting is a natural next step to offline learning where a fixed policy does not get to incorporate feedback in order to adjust itself during deployment. In this talk, I will try to motivate that this setting is important in terms of online learning goals since an offline-online RL agent digests abundant offline data and then also gets to interact with the world in hopes of fine-tuning itself to better adapt to the environment at hand. I will also highlight that one important approach to this is representation learning where the goal is to learn feature representations that facilitate minimal updates online for faster learning.

**Mathis Federico (July 27, 2022)**

*Hierarchical partial behaviour explanations, a framework to transfer knowledge back to humans?*

In the RL community large effort is made to transfer knowledge from humans to agent aiming at speeding up their learning, but very much less is done for the other way around. However, agents are known to be better than humans at some tasks like playing Chess, Go or most reflex-based Atari games. As agents keep getting better, one might ask the question: How can we learn from those superhuman agents? This simple question lead to more and more questions (pls help) on how do human even explain things, could we formalize it? Could we quantify the quality, the importance, the priority of an explanation? Does this depend on the agent, the environment? This very explorative work ponders on those questions and make small few steps by defining a framework for behaviour explanations as hierarchical and partial graphs, allowing explanation complexity to be quantified in the sense of Kolmogorov complexity. This work experiments on a new environment (Crafting) abstracted from the popular game Minecraft and we see interesting relation appear between time to RL agent success and explanation complexity.

**Michael Przystupa (July 28, 2022)**

In this talk, we will discuss our recent research on latent action models for robotic control tasks, where one desires a mapping that allows control with a low dimensional input space. Previous works have focused on conditional autoencoders (CAE) as the de-facto latent action model choice. These models compress action representations into nonlinear space and compensate for missing information by providing state information to the decoder model. CAEs have been successful in both robotic teleoperation and reinforcement learning. However, in our experience, CAEs often produce unintuitive and difficult-to-control interfaces. Enforcing useful properties requires explicit constraints in the loss function while training CAE latent action models. In our research, we propose a hierarchical representation that converts the problem of learning low-dimensional actions as predicting linear subspaces given relevant task-specific information. This inductive bias exploits the natural symmetries in robotic control problems where inverse actions are often available. We call these model state conditioned linear maps. Our system guarantees valuable properties for teleoperation systems with this decomposition between action prediction and context conditioning. Experimentally, we find that for nonlinear tasks where actions do not exist in a single linear subspace, SCL maps are more effective than alternative latent actions models.

**David Tao (August 2, 2022)**

*Agent-State Construction with Auxiliary Inputs*

In most realistic sequential decision-making tasks, the decision-making agent is not able to model the full complexity of the world. In reinforcement learning, the environment is often much larger and more complex than the agent, a setting also known as partial observability. In such settings, the agent must leverage more than just the current sensory inputs; it must construct an agent state that summarizes the agent’s previous interactions with the world. Currently, the most common approach to tackle such a problem is to learn the agent-state function with a recurrent network. This is done with the agent's sensory stream as input, which is often augmented with transformations of the agent's observation. These augmentations are done in multiple ways, from simple approaches like concatenating observations to more complex ones such as uncertainty estimates or predictive representations. Nevertheless, although ubiquitous in the field, these additional inputs, which we term auxiliary inputs, are rarely emphasized, and it is not clear what their role or impact is. In this paper we formalize a framework for agent-state construction with auxiliary inputs and we present examples of auxiliary inputs that capture the past, present, and the future of the agent-environment interaction. We show that auxiliary inputs allow an agent to discriminate between observations that would otherwise be aliased, leading to more expressive features that smoothly interpolate between different states. We empirically test this agent-state construction method with different function approximators, using different instantiations of these auxiliary inputs across a variety of tasks. Our approach is complementary to state-of-the-art methods such as recurrent neural networks, and acts as a heuristic that facilitates longer temporal credit assignment, reducing the number of time steps needed when performing truncated backpropagation through time and leading to better performance.

**Kenny Young (August 4, 2022)**

*Towards a Better Understanding of the Benefit of Learning a Parametric Model for Reinforcement Learning*

Model-based reinforcement learning is widely believed to have the potential to drastically improve sample efficiency by allowing an agent to synthesize large amounts of imagined experience. Experience Replay can be thought of as a particularly simple kind of model, which effectively samples from an empirical MDP consisting of only observed transitions. This simple strategy has proved extremely effective at improving the stability and efficiency of deep reinforcement learning. One may imagine a learned parametric model could further improve on this by generalizing from real experience to augment the dataset with additional plausible experience that does not appear explicitly. However, given that learned value-functions can also generalize, it is not immediately clear why we should expect model generalization to be inherently better than value-function generalization. Here I present a simple result which motivates why we might expect model generalization to be more useful than value-function generalization. Roughly, this result amounts to showing that when using a dataset to narrow down possible value functions, we can generally narrow it down more if we first use the data to narrow down the set of possible models, and then consider value functions consistent with those models, than if we only demand the value function obeys the Bellman optimality equation with respect to the observed transitions.

**Erfan Miahi (August 9, 2022)**

*Investigating the Properties of Deep Reinforcement Learning Representations that Do and Do Not Generalize Well*

In this work, we investigate the connection between the properties and the generalization performance of representations learned by deep reinforcement learning algorithms. Much of the earlier work on representation learning for reinforcement learning focused on designing fixed-basis architectures to achieve properties thought to be desirable, such as orthogonality and sparsity. In contrast, the idea behind deep reinforcement learning methods is that the agent designer should not encode representational properties, but rather that the data stream should determine the properties of the representation---good representations emerge under appropriate training schemes. We bring these two perspectives together, empirically investigating the properties of representations that are good at generalization in reinforcement learning. This analysis allows us to provide novel hypotheses regarding the impact of auxiliary tasks in end-to-end training of deep reinforcement learning methods. We introduce and measure six representational properties over more than 28 thousand agent-task settings. We consider DQN agents with convolutional networks in a pixel-based navigation environment. We develop a method to better understand why some representations improve generalization, through a systematic approach varying task similarity and measuring and correlating representation properties with generalization performance. Using this insight, we design two novel auxiliary losses and show that they generalize as well as our best baselines.

**Samuele Tosatto (August 10, 2022)**

*A Gradient Critic for Policy Gradient Estimation*

The policy gradient theorem prescribes the usage of a cumulative discounted state distribution under the target policy to approximate the gradient. In practice, most algorithms based on this theorem break this assumption, introducing a distribution shift that can cause the convergence to poor solutions. In this talk, we argue that classic policy gradients are Monte-Carlo estimators; therefore, they suffer from high variance and become problematic when samples are off-policy. We propose a novel gradient estimator based on a gradient Bellman equation. This Bellman equation allows redefining the policy gradient in recursive terms and approximating it using temporal-difference techniques. Our estimator, called gradient critic, can be efficiently used to improve the policy. We will discuss limitations and possible future development of our method.

**Mehran Taghian Jazi (August 11, 2022)**

**Manan Tomar (August 16, 2022)**

*Incomplete Ideas on Minimal Representations, Neural Assemblies and Sparsity*

This talk will comprise two parts. In part one, I will discuss a concrete idea, that of learning minimal representations, i.e. ones that create an information bottleneck. I will predominantly discuss results for vision domains but I hope to extend these to the RL setting. In part two, I will discuss related ideas that have intrigued me in terms of their application for RL but are still incomplete or half-baked. These involve learning assemblies of neurons, sparse activations, and using random connections. I hope this can spark some interesting discussion and lead to more practical ideas, ones we can quickly test on small domains.

**Gautham Vasan (August 17, 2022)**

*Reset as an action*

Many deep reinforcement learning approaches rely on task-specific prior knowledge to carefully engineer a guiding reward (also known as a dense reward) signal. This can often bias the solution that the agent can find. For many guiding reward tasks, there is a related minimum-time task (i.e., a reward of -1 for each timestep) that is much easier to specify and still captures the desired behavior of agents. In addition, when minimum-time formulation is used, the agent can discover novel and potentially superior solutions. However, the minimum-time formulation is thought to be hard to solve. This is partly due to poor exploration using the initial random policy, which results in the agent not reaching the goal quickly and often enough. It is common to use timeout, after which the environment is reset to diversify the agent’s experience. We identify timeout as a solution parameter, and when tuned appropriately, it can help the agent reach the goal more often, and we may even solve complex minimum-time tasks. However, the choice of hyper-parameters of a learning algorithm and overall learning performance isn’t invariant to timeout. In this work, we propose a novel approach called “reset as an action”, where an agent learns when to invoke reset instead of relying on a fixed timeout. Our approach uses Soft Actor Critic to learn both a control policy and when to invoke reset simultaneously, thus getting rid of one tunable parameter, timeout. We evaluate our proposed approach on a range of simulated benchmark tasks and challenging vision-based real-world tasks involving the UR5 robot arm and two mobile robots: iRobot Create2 and Anki Vector.

**Shahin Atakishiyev (August 23, 2022)**

*Development of explainable reinforcement learning approaches for safe autonomous driving*

There has been growing attention in the development of autonomous driving systems over the last decade due to empirical successes of deep learning (DL) and reinforcement learning (RL) approaches. However, recent accident reports and safety concerns prevent the widespread deployment and commercialization of autonomous vehicles. As AI methods, particularly RL techniques power sequential decisions of autonomous cars, regulatory organizations analyze the intelligent driving system of a vehicle to understand actual causes of accidents. Hence, there is an expectation from stakeholders and regulatory bodies that AI-based operations of autonomous vehicles, especially in critical traffic scenarios, should be explainable on top of being acceptably safe. In this context, we are developing explainable reinforcement learning (XRL) approaches that enable self-driving cars to make safe and explainable real-time decisions.

**Shivam Garg (August 24, 2022)**

*An Ablation Study of Optimization Techniques and Surrogate Objectives for Policy Gradient Methods*

In this talk, we will empirically explore how different optimization techniques (constrained optimization vs regularization) affect the performance of an agent using different policy gradient (PG) surrogate objective functions in the simplest setting. We will consider three different surrogate objectives (corresponding to the PG algorithms TRPO, MDPO, and a new variant sMDPO), an agent with a tabular softmax policy, and a tabular environment.

**Rich Sutton (August 25, 2022)**

*The Alberta Plan for AI Research*

I will present a strategic research plan based on the premise that a genuine understanding of intelligence is imminent and—when it is achieved—will be the greatest scientific prize in human history. To contribute to this achievement and share in its glory will require laser-like focus on its essential challenges; identifying those, however provisionally, is the objective of the Alberta Plan. The overall setting is the familiar one common to many fields (reinforcement learning, psychology, control theory, economics, neuroscience, and operations research): a computationally-limited agent interacts with a vastly more complex environment to maximize reward. The agent’s machinery is divided into four parts: 1) that which maintains the agent’s situational state (perception), 2) that which maps state to action (policy), 3) that which maps state to expected future reward (value function), and 4) that which maps imagined states and actions to next states (transition model) and enables planning. The Alberta Plan extends this common view to include feature-based subtasks and temporally extended options to solve them; the policy and the value function each become multiple, one each for each of the subtasks and the main task. The setting is then potentially complete and the focus shifts to finding the right abstractions, in state (features) and time (options), and to planning efficiency. The Alberta Plan incorporates continual learning and meta-learning into all of its 12 steps, and expends no effort trying to capture domain knowledge.

*The 2022 tea time talks were coordinated by Sheila Schoepp (sschoepp AT ualberta DOT ca).*