我的代理人一直采取随机行动,因此算法没有正确训练。如何确保它采取存储在“next_action,ArgMax = custom_argmax(Q_value)”行中的最佳操作。函数custom_argmax计算为每个状态,动作对找到的最大Q值。
max_episodes = 10
max_steps_per_episode = 1000
discount_rate = 0.99
exploration_rate = 0.5
max_exploration_rate = 1
min_exploration_rate = 0.1
learning_rate = 0.01
explore_decay_rate = 0.2
errors = []
def play_single_game(max_steps_per_episode, render):
global errors
state = env.reset()
# print('We are resetting: ' )
action = env.action_space.sample()
for step in range(max_steps_per_episode - 1):
# if episode == max_episodes - 1:
if render:
env.render()
# print("This is the Ac:", a)
'''
if step%2 == 0:
a = 1
else:
a = 1
'''
new_state, reward, done, info = env.step(action) # declare all, gets new state from taking certain action
# print(info)
next_state = new_state
# print(reward)
old_weights = weights.theta.copy()
if done == True:
weights.theta += learning_rate * (reward - weights_multiplied_by_features(state, action)) * feature_space(state, action)
# print("we are done")
break
else:
# not finished
Q_value= associated_Q_value(next_state)
exploration_rate_threshold = random.uniform(0, 1)
next_action, ArgMax = custom_argmax(Q_value) # is best action
if exploration_rate_threshold < exploration_rate: # take random
r = random.randint(0, len(LEGAL_MOVES) - 1)
next_action = r
# we will update Q(s,a) AS we experience the episode
weights.theta += learning_rate * (reward + discount_rate * ArgMax - weights_multiplied_by_features(state, action)) * feature_space(state, action)
# next state becomes current state
state = next_state
action = next_action
change_in_weights = np.abs(weights.theta - old_weights).sum()
errors.append(change_in_weights)
你正在做ε-贪婪的探索。你已经设置了exploration_rate = 0.5
,所以你的代理人总是会采取50%的随机行动。这可能太高了,但这并不意味着你的经纪人没有学习。
如果要正确评估代理,则必须运行禁用探索的情节。你不能只是禁用随机动作,因为它可能永远不会尝试不同的动作;它被称为勘探/开发权衡。但是,您可以在座席学习时慢慢拨打探索,例如在你的循环中使用exploration_rate *= 0.999
或类似的。