greedy_gq module¶
Module containing the GreedyGQ algorithm.
Also supports prioritized and uniform experience replay
- Authors:
- Shibhansh Dohare, Niko Yasui.
-
class
greedy_gq.
GreedyGQ
(action_space, finished_episode, num_features, alpha, beta, lmbda, decay=False, **kwargs)[source]¶ An implementation of GreedyGQ learning algoritm.
Implementation on greedy_gq based on https://era.library.ualberta.ca/files/8s45q967t/Hamid_Maei_PhDThesis.pdf and prioritized experience replay based on https://arxiv.org/pdf/1511.05952.pdf Doesn’t update some paramenters when we are replaying experience (either uniform or prioritized).
-
action_space
¶ numpy array of action – Numpy array containing all actions available to any agent.
-
finished_episode
¶ fun – Function that evaluates if an episode has been finished or not.
-
num_features
¶ int – The number of features in the state-action representation.
-
alpha
¶ float – Primary learning rate.
-
beta
¶ float – Secondary learning rate
-
lmbda
¶ float – Trace decay rate.
-
Note
¶ A copy of phi is created during the construction process.
-
predict
(phi, action)[source]¶ Builds the action state reperesentaiton and multiplies by theta.
Parameters: - phi (numpy array of bool) – Boolean feature vector.
- action (action) – Action that was taken.
-
td_error_prioritized_experience_replay
(*args, **kwargs)[source]¶ Replays worst experiences from memory.
self.worst_experiences
stores the last 100 experiences. Thenum_updates_to_make
experiences with the highest TD error are chosen for replay.
-
uniform_experience_replay
(*args, **kwargs)[source]¶ Replays experiences from saved memory.
Replays
num_updates_to_make
experiences.self.worst_experiences
stores the most recent 100 experiences.
-
update
(phi, last_action, phi_prime, cumulant, gamma, rho, replaying_experience=False, **kwargs)[source]¶ Updates the parameters (weights) of the greedy_gq learner.
Doesn’t update some paramenters when we are replaying experience (either uniform or prioritized).
Parameters: - phi (numpy array of bool) – State at time t.
- last_action (action) – Action at time t.
- phi_prime (numpy array of bool) – State at time t+1.
- cumulant (float) – Cumulant at time t.
- gamma (float) – Discounting factor at time t+1.
- rho (float) – Off policy importance sampling ratio at time t.
- replaying_experience (bool) – True if replaying an experience, false if gathering a new experience from the environment.
Returns: - Representation for the
state-action pair at time t. Only used to calculate RUPEE.
Return type: self.action_phi (numpy array of bool)
-