我正在通过this教程并看到以下代码:
# Calculate score to determine when the environment has been solved
scores.append(time)
mean_score = np.mean(scores[-100:])
if episode % 50 == 0:
print('Episode {}\tAverage length (last 100 episodes): {:.2f}'.format(
episode, mean_score))
if mean_score > env.spec.reward_threshold:
print("Solved after {} episodes! Running average is now {}. Last episode ran to {} time steps."
.format(episode, mean_score, time))
break
但是,这对我来说并没有多大意义。如何定义何时“RL环境已经解决”?不确定这甚至意味着什么。我想在分类中将它定义为当损失为零时是有意义的。在回归中,当总l2损失小于某个值时?当预期回报(折扣奖励)大于某个值时,也许有必要定义它。
但在这里似乎他们正在计算时间步数#?这对我没有任何意义。
注意original tutorial有这个:
def main(episodes):
running_reward = 10
for episode in range(episodes):
state = env.reset() # Reset environment and record the starting state
done = False
for time in range(1000):
action = select_action(state)
# Step through environment using chosen action
state, reward, done, _ = env.step(action.data[0])
# Save reward
policy.reward_episode.append(reward)
if done:
break
# Used to determine when the environment is solved.
running_reward = (running_reward * 0.99) + (time * 0.01)
update_policy()
if episode % 50 == 0:
print('Episode {}\tLast length: {:5d}\tAverage length: {:.2f}'.format(episode, time, running_reward))
if running_reward > env.spec.reward_threshold:
print("Solved! Running reward is now {} and the last episode runs to {} time steps!".format(running_reward, time))
break
不确定这是否更有意义......
这只是这个环境/任务的特殊怪癖吗?一般来说,任务如何结束?
在cartpole equals the reward of the episode的情况下使用的时间。你平衡杆的时间越长,得分越高,停在某个最大时间值。
因此,如果最后一集的运行平均值足够接近最大时间,则该集将被视为已解决。