greedy_gq module

Module containing the GreedyGQ algorithm.

Also supports prioritized and uniform experience replay

Authors:
Shibhansh Dohare, Niko Yasui.
class greedy_gq.GreedyGQ(action_space, finished_episode, num_features, alpha, beta, lmbda, decay=False, **kwargs)[source]

An implementation of GreedyGQ learning algoritm.

Implementation on greedy_gq based on https://era.library.ualberta.ca/files/8s45q967t/Hamid_Maei_PhDThesis.pdf and prioritized experience replay based on https://arxiv.org/pdf/1511.05952.pdf Doesn’t update some paramenters when we are replaying experience (either uniform or prioritized).

action_space

numpy array of action – Numpy array containing all actions available to any agent.

finished_episode

fun – Function that evaluates if an episode has been finished or not.

num_features

int – The number of features in the state-action representation.

alpha

float – Primary learning rate.

beta

float – Secondary learning rate

lmbda

float – Trace decay rate.

Note

A copy of phi is created during the construction process.

predict(phi, action)[source]

Builds the action state reperesentaiton and multiplies by theta.

Parameters:
  • phi (numpy array of bool) – Boolean feature vector.
  • action (action) – Action that was taken.
td_error_prioritized_experience_replay(*args, **kwargs)[source]

Replays worst experiences from memory.

self.worst_experiences stores the last 100 experiences. The num_updates_to_make experiences with the highest TD error are chosen for replay.

uniform_experience_replay(*args, **kwargs)[source]

Replays experiences from saved memory.

Replays num_updates_to_make experiences. self.worst_experiences stores the most recent 100 experiences.

update(phi, last_action, phi_prime, cumulant, gamma, rho, replaying_experience=False, **kwargs)[source]

Updates the parameters (weights) of the greedy_gq learner.

Doesn’t update some paramenters when we are replaying experience (either uniform or prioritized).

Parameters:
  • phi (numpy array of bool) – State at time t.
  • last_action (action) – Action at time t.
  • phi_prime (numpy array of bool) – State at time t+1.
  • cumulant (float) – Cumulant at time t.
  • gamma (float) – Discounting factor at time t+1.
  • rho (float) – Off policy importance sampling ratio at time t.
  • replaying_experience (bool) – True if replaying an experience, false if gathering a new experience from the environment.
Returns:

Representation for the

state-action pair at time t. Only used to calculate RUPEE.

Return type:

self.action_phi (numpy array of bool)