greedy_gq module¶

Module containing the GreedyGQ algorithm.

Also supports prioritized and uniform experience replay

class greedy_gq.GreedyGQ(action_space, finished_episode, num_features, alpha, beta, lmbda, decay=False, **kwargs)[source]¶

An implementation of GreedyGQ learning algoritm.

Implementation on greedy_gq based on https://era.library.ualberta.ca/files/8s45q967t/Hamid_Maei_PhDThesis.pdf and prioritized experience replay based on https://arxiv.org/pdf/1511.05952.pdf Doesn’t update some paramenters when we are replaying experience (either uniform or prioritized).

action_space¶: numpy array of action – Numpy array containing all actions available to any agent.

finished_episode¶: fun – Function that evaluates if an episode has been finished or not.

num_features¶: int – The number of features in the state-action representation.

predict(phi, action)[source]¶

Builds the action state reperesentaiton and multiplies by theta.

Parameters:	phi (numpy array of bool) – Boolean feature vector. action (action) – Action that was taken.

td_error_prioritized_experience_replay(*args, **kwargs)[source]¶

Replays worst experiences from memory.

self.worst_experiences stores the last 100 experiences. The num_updates_to_make experiences with the highest TD error are chosen for replay.

uniform_experience_replay(*args, **kwargs)[source]¶

Replays experiences from saved memory.

Replays num_updates_to_make experiences. self.worst_experiences stores the most recent 100 experiences.

update(phi, last_action, phi_prime, cumulant, gamma, rho, replaying_experience=False, **kwargs)[source]¶

Updates the parameters (weights) of the greedy_gq learner.

Doesn’t update some paramenters when we are replaying experience (either uniform or prioritized).

Parameters:

phi (numpy array of bool) – State at time t.
last_action (action) – Action at time t.
phi_prime (numpy array of bool) – State at time t+1.
cumulant (float) – Cumulant at time t.
gamma (float) – Discounting factor at time t+1.
rho (float) – Off policy importance sampling ratio at time t.
replaying_experience (bool) – True if replaying an experience, false if gathering a new experience from the environment.

Returns:

Representation for the: state-action pair at time t. Only used to calculate RUPEE.

Return type:

self.action_phi (numpy array of bool)