Rich Sutton (May 28, 2018)
The Next Big Step in AI: Planning with a Learned Model

Martha White (May 30, 2018)
Improving Regression Performance with Distributional Losses
Motivated by success in distributional RL and in label augmentation in classification, we investigate distributional losses in regression. We propose a new family of distributional losses for regression, which are efficient to compute. We investigate empirically if such losses provide performance improvements and why, potentially shedding some light on the successes in distributional RL and in classification.

Kris De Asis (May 31, 2018)
Predicting Periodicity with Temporal Difference Learning
In reinforcement learning settings, the discount rate is often used to specify the horizon of interest in a cumulative reward sequence. It would determine how long-term a temporal difference learning agent would predict into the future. In this talk, we imagine alternative uses of the discount rate with insight from digital signal processing.

Zach Holland (June 4, 2018)
The Effect of Planning Shape on Dyna-style Planning in High-dimensional State Spaces
Dyna is an architecture for reinforcement learning agents that interleaves planning, acting, and learning in an online setting. Dyna aims to make fuller use of limited experience to achieve better performance with fewer environmental interactions. In Dyna, the environment model is typically used to generate one-step transitions from selected start states. We applied one-step Dyna to several games from the Arcade Learning Environment and found that the model-based updates offered little benefit, even with a perfect model. However, when the model was used to generate longer trajectories of simulated experience, performance improved dramatically. This observation also holds when using a model that is learned from experience; even though the learned model is flawed, it can still be used to accelerate learning.

Adam White (June 7, 2018)
Organizing Experience: A Deeper Look at Replay Mechanisms for Sample-based Planning in Continuous State Domains

Wesley Chung (June 14, 2018)
Rejection sampling for off-policy learning
Off-policy learning, the ability to learn about policies other than the one currently in use, is an important but difficult problem in reinforcement learning. A common technique for this task is to use importance sampling. Though simple, it can lead to undesirable updates with high variance. This talk will present an alternative, rejection sampling, and explore its merits.

Yuxi Li (June 18, 2018)
(Deep) Reinforcement Learning: Challenges and Opportunities
In this talk, I will review several recent achievements, discuss several issues, and propose several potential research projects as opportunities, for deep reinforcement learning. The talk will be an open discussion.

Chen Ma (June 20, 2018)
Universal Successor Representations for Transfer Reinforcement Learning

Ajin Joseph (June 21, 2018)
Model based search methods and their application to RL
The objective of the talk is to introduce a particular class of zero-order optimization methods called model-based search methods which are gradient-free techniques to generate high quality solutions to the optimization (deterministic or stochastic) problem. These methods have the unique characteristic to operate in a black-box setting where it is presumed that the analytic closed form expression of the objective function is unavailable and hence they are completely non-dependent on the structural properties of the objective function. In this talk, I explore various algorithms of this class, their efficient extensions and their application to reinforcement learning.

Craig Sherstan (June 25, 2018)
Generalizing Value Estimation over Timescale
General value functions (GVFs) are an approach to representing models of an agent's world as a collection of predictive questions. A GVF is defined by: a policy, a prediction target, and a timescale. Traditionally predictions for a given timescale must be specified by the engineer and each timescale learned independently. Here we present γ-nets, a method for generalizing value function estimation over timescale, allowing a given GVF to be trained and queried for any fixed timescale. The key to our approach is to use timescale as one of the network inputs. The prediction target for any fixed timescale is then available at every timestep and we are free to train on any number of timescales. We present preliminary results on simple test signals.

Nidhi Hegde (June 27, 2018)
Learning in Distributed systems
The talk will present some work by researchers in economics on distributed learning. The goal is to give a different perspective to "learning by trial and error".

Muhammad Zaheer (June 28, 2018)
Revisiting tabular reinforcement learning for data-efficient control in complex high-dimensional environments
In recent years, significant efforts have been made to augment reinforcement learning with the representation power of neural networks. While this synthesis has enabled RL agents to rival human-level performance in rich visual benchmarks, it still requires millions of interactions with the environment. Much of this inefficiency stems from the limitation of neural networks to assimilate and adapt quickly to the rewarding experiences. In this talk, we will review some of the recent work which combines tabular reinforcement learning with the expressiveness of neural networks for data-efficient control. We will also discuss the implications of the resulting design choices from the perspective of model-based reinforcement learning.

Kenny Young (July 4, 2018)
Generalized Adversarial Value Functions: An idea for discovery of “interesting” world knowledge independent of any reward signal
Generalized Value Functions provide a framework for formulating questions about the world such that the answers can be learned with conventional temporal difference learning methods. The question remains of how to come up with questions that are interesting or useful to a reinforcement learning agent. In this talk I will discuss a possible approach to this by analogy to Generative Adversarial Networks. While GANs provide one possible answer to the question “what does it mean to produce images that appears similar to an underlying training set?“, GAVFs are suggested to provide a possible answer to the question “what does it mean for a question about the world to be interesting?“. Both methods work by training a predictive agent against an adversary which in turn is trained to generate instances which fool the predictor.

Han Wang (July 5, 2018)
Reduce the compounding error in multi-step prediction using Data As Demonstrator
In multi-step prediction, the learned simulator is always iteratively applied. When the prediction on time step t is fed into the simulator to make another prediction on time step t+1, an accumulating error exists. Thus, the prediction is gradually far away from the reality. In the construction of a water treatment system simulator, we applied Data As Demonstrator (DAD) algorithm to solve this problem. This algorithm was originally presented in paper improving multi-step prediction of learned time series models by A Venkatraman, M Hebert, and JA Bagnell in 2015. I will introduce this algorithm and show how it improves the performance of the simulator in the water treatment project.

Somjit Nath (July 9, 2018)
Understanding Back-Propagation in Recurrent Neural Networks
The talk will be about various algorithms for training RNNs, and how it can be improved and made faster.

Taher Jafferjee (July 11, 2018)
Challenges in Model-based Reinforcement Learning with Function Approximators… and our Solutions to Some of These Challenges
Modern deep Reinforcement Learning (RL) methods suffer from poor sample efficiency. One approach to improving sample efficiency of these methods is model-based RL, where environment models are used to speed up learning.
In this talk, I will describe some challenges we have encountered in developing a Prioritized Sweeping style algorithm for the function approximator setting. Specifically, I will illustrate how target networks impede the propagation of value information, and discuss the effects of random network initialisation. Finally, I will discuss our solutions to some of these obstacles and present some initial results.

Niko Yasui (July 12, 2018)
Problem drift in reinforcement learning control
Reinforcement learning (RL) problems related to action selection, or control, have gradually changed since the late 1950s. I discuss how Value Iteration, a precursor to deep Q-networks (DQN), has inspired algorithms that each solve one of these slightly different control problems. I argue that the Value Iteration was developed for a problem that is so different from the current RL control setting that the performance of methods related to Value Iteration is inherently limited when those methods are applied to the current problem setting.

Tian Tian (July 16, 2018)
Eploring Expected Sarsa with Changing Target Policy

J. Fernando Hernandez (July 18, 2018)
The Deep Q(σ) Network: A Comparison of Different Deep Reinforcement Learning Algorithms
The orignal DQN algorithm has become ubiquitous to Deep Reinforcement Learning research. However, despite of all the follow-up research that DQN has inspired, little has been done to adapt other algorithms, such as Sarsa or Expected Sarsa, to be used with this architecture. In this talk I will present a way to combine the action-value algorithm n-step Q(σ) with DQN. The resulting architecture has been named Deep Q(σ) Network or DQ(σ)N. Since n-step Q(σ) is capable of representing other algorithms such as Sarsa, Expected Sarsa, and more, the DQ(sigma)N architecture provides a flexible framework to study the performance of all these algorithms when combined with the DQN architecture. I will present several experiments in the Mountain Car environment that compare the performance of DQ(σ)N for different parameter combinations. The results show that the DQ(σ)N is a promising alternative to DQN.

Bo Liu (July 19, 2018)
A Block Coordinate Ascent Algorithm for Mean-Variance Optimization
Risk management in dynamic decision problems is a primary concern in many fields, including financial investment, autonomous driving, and healthcare. The mean-variance function is one of the most widely used objective functions in risk management due to its simplicity and interpretability. Existing algorithms for mean-variance optimization are based on multi-time-scale stochastic approximation, whose learning rate schedules are often hard to tune, and have only asymptotic convergence proof. In this talk, we develop a model-free policy search framework for mean-variance optimization with finite-sample error bound analysis (to local optima). Our starting point is a reformulation of the original mean-variance function with its Fenchel dual, from which we propose a stochastic block coordinate ascent policy search algorithm. Both the asymptotic convergence guarantee of the last iteration's solution and the convergence rate of the randomly picked solution are provided.

Sungsu Lim (July 23, 2018)
Policy Gradient Methods: Comparison of Policy Gradient Estimator methods
In continuous control problems, Policy gradient methods have been very popular, and lots of policy gradient methods have come about in the recent years. They employ diverse optimization techniques, but in essence they all require computing the policy gradient with respect to the performance objective. It is important to understand different types of gradient estimator methods, and in this talk we explain and compare two different types of gradient estimator methods: Loglikelihood estimators and Pathwise Derivative estimators.

Dylan Brenneis (July 25, 2018)
Introduction to an Automatically Levelling Prosthetic Wrist
In a flavour somewhat different but tangentially related to the field of machine learning, the problem of prosthesis control poses many opportunities for creative solutions. In particular, wrist movement is woefully lacking in commercial prosthetic systems due in large part to the difficulty in controlling the many degrees of freedom. In this presentation, I propose a novel wrist control method to allow more natural wrist movement, explore how it may be best implemented, and consider some applications for ML to be used in concert with this method.

Roshan Shariff (July 26, 2018)
Predicting Rewards at Every Time Scale
In reinforcement learning, future rewards are often discounted: we prefer rewards we receive immediately rather than those far in the future. The rate of discounting imposes a "time scale" on our reward valuation and is incorporated into the learned value functions. In this talk, I discuss how learning value functions with several different discount factors allows us to reason about the detailed temporal structure of future rewards.

Amir Samani (July 30, 2018)
Temporal-difference networks
Temporal-difference (TD) network is a knowledge representation framework that allows the agent to relate knowledge to its own experience. In this talk, I discuss how TD networks separate the problem of prediction into questions and answers, followed by explanation of their TD and extensive semantics.

Eric Graves (August 1, 2018)
Revisiting the Policy Gradient Theorem for Episodic Problems
In reinforcement learning control problems, the use of function approximation comes at a cost. The policy improvement theorem no longer holds, and changing the policy to improve the value of a state may not actually improve the performance of the policy due to generalization between states in the function approximator. However, the policy gradient theorem offers a solution in the form of an exact expression for the gradient of the performance objective with respect to the policy parameters. In this talk, I review the policy gradient theorem for the episodic setting, and highlight several interesting insights.

Dustin Morrill (August 2, 2018)
AI Safety Through Robust Planning
When it is infeasible, expensive, or dangerous to execute an experimental policy in a decision problem, how do we learn a competent policy offline from a model with limited fidelity? An optimal policy under such a model will overfit to the model’s imperfections, creating the potential for catastrophic mistakes when that policy is applied to the original task. One approach to this problem is to find policies that respect model uncertainty and risk tolerance. Chen and Bowling (2008) previously showed that a broad class of robustness measures, which includes Conditional Value at Risk (CVaR), can often be maximized efficiently by solving zero-sum imperfect information games. By utilizing game solving innovations developed for solving human-scale imperfect information games like poker, we can obtain robust policies from uncertain models that avoid potentially hazardous behavior in the original task. Results from a toy contextual bandit problem illustrate the benefits and flexibility of this approach.

Alex Kearney (August 8, 2018)
Robot Knowledge: Strengths and Pitfalls
Predictive knowledge describes a collection of machine intelligence proposals which aim to create a theory of all world knowledge grounded in sensorimotor predictions. In this Tea Time Talk, we’ll explore what predictive knowledge is and why we should care about the underlying assumptions made by these proposals. Along the way we will discuss some of the unusual commitments of predictive knowledge, fleshing out how these both strengthen and weaken our attempts at building knowledgable machines. Come for the cookies; stay for the epistemology.

Yi Wan (August 9, 2018)
Does my RL algorithm converge?
In this talk, I will introduce a graphic way to understand the convergence properties of some RL algorithms.

Matthew Schlegel (August 13, 2018)
Importance Weighted Experience Replay

Erik Talvitie (August 15, 2018)
Toward Object-Oriented Dynamics Models in Atari
The ability to reason about objects and their relationships seems like an important part of the ability to make predictions about the world in novel situations. That generalization may, in turn, be key for model-based reinforcement learning. In this talk I'll discuss some of the recent literature on learning object-based models, highlighting some general principles that emerge and some open problems. In particular I'll discuss some of the challenges posed by Atari games, where object identities and types are not known a priori, objects can appear and disappear, and interactions may be complex. I'll also present some very preliminary work attempting to address some of these challenges.

Vincent Liu (August 16, 2018)
Sparsity for control
Reinforcement learning is known to be unstable when a neural network is used as a function approximator [1]. Doing several updates on similar states may overwrite what the agent has learned in other part of the environment. This is called the ‘catastrophic forgetting’ problem. In this talk, I will give some intuitions and preliminary work on why sparsity might be robust to forgetting and stabilize the training of control agents.

Kim Solez and Ishita Moghe (August 20, 2018)
Nonfictional Models of the World for Training Sentient AI and Truthful Promotion of Pathology
I am presenting the attached poster at the Digital Pathology and Artificial Intelligence meeting in NYC June 26-27, and Ishita Moghe and I are interested in talking with the tea-timegroup about fictional models of the world created for training of AI as an extension of corporate marketing messages and the models of the world machines themselves create which may seem sufficient unto themselves and not needing further human input. What is a spectacular new event in mid 2018 is the appearance of contradictory publications by the same author at the same time, one with corporate input, one without, that speculate on whether or not physicians will be replaced by machines Here are two videos with the two of us explaining the issues further: The worlds the machines themselves create are described in this article from yesterday's Guardian: "Google Translate was known for its humorous errors, but in 2016, the system started using a neural network developed by Google Brain, and its abilities improved exponentially. Rather than simply cross-referencing heaps of texts, the network builds its own model of the world, and the result is not a set of two-dimensional connections between words, but a map of the entire territory. In this new architecture, words are encoded by their distance from one another in a mesh of meaning – a mesh only a computer could comprehend." More recent information: and

Dornoosh Zonoobi (August 22, 2018)
Toward the tricorder: deep learning analysis of ultrasound images
One of the most fascinating and futuristic devices in Star Trek was a tricorder, a multifunction hand-held device used by doctors to help diagnose diseases and collect bodily information about a patient. This single device appeared to have all the answers to the mysteries of human body. Now that AI and cloud computing is making medical imaging more accurate and intuitive, we are closer to this device than you might expect.

Sina Ghiassian (August 23, 2018)
Can coarse coding solve the catastrophic interference problem?
Reinforcement learning systems must use function approximation to solve complicated problems. Neural nets provide an effective architecture for nonlinear function approximation and have been used since the early days of reinforcement learning. Neural networks, however, suffer from the catastrophic interference problem and cannot learn online in a fully incremental fashion. Experience replay buffers have been used to work around the interference problem but the search for methods that can learn in a fully incremental manner continues. This talk introduces a new method, a simple combination of coarse coding and neural networks, that might be useful in solving the interference problem. Our method is capable of learning fast, in a fully incremental fashion.

Marlos Machado (August 27, 2018)
Count-Based Exploration with the Successor Representation
The problem of exploration in reinforcement learning is well-understood in the tabular case and many sample-efficient algorithms are known. Nevertheless, it is often unclear how the algorithms in the tabular setting can be extended to tasks with large state-spaces where generalization is required. Recent promising developments generally depend on problem-specific density models or handcrafted features. In this paper we introduce a simple approach for exploration that allows us to develop theoretically justified algorithms in the tabular case but that also give us intuitions for new algorithms applicable to settings where function approximation is required. Our approach and its underlying theory is based on the substochastic successor representation, a concept we develop here. While the traditional successor representation is a representation that defines state generalization by the similarity of successor states, the substochastic successor representation is also able to implicitly count the number of times each state (or feature) has been observed. This extension connects two until now disjoint areas of research. We show in traditional tabular domains (RiverSwim and SixArms) that our algorithm empirically performs as well as other sample-efficient algorithms. We then describe a deep reinforcement learning algorithm inspired by these ideas and show that it matches the performance of recent pseudo-count-based methods in hard exploration Atari 2600 games.

Ehsan Imani (August 29, 2018)
Off-Policy Deterministic Policy Gradient

Shangtong Zhang (August 30, 2018)
Exploration with the Distributional Perspective of Reinforcement Learning
In this presentation, I will talk about how to improve exploration based on the distributional perspective of reinforcement learning. We propose two methods to make use of the distribution of the state-action value function. We use simple MDPs to explain the improved exploration and verify the scalability of our methods in both challenging video games (e.g., 49 Atari games) and physical robot simulators (e.g., 12 Roboschool tasks).

Eugene Chen (September 6, 2018) Dancing, Colourful MDPs for Research & Fun

The 2018 tea time talks were coordinated by Niko Yasui (yasui AT ualberta DOT ca).