我正在使用Python3.10了解OpenAI的GYM(0.25.1),健身房的环境设置为
'FrozenLake-v1
(代码如下)。
根据 documentation,调用
env.step()
应返回一个包含 4 个值(观察、奖励、完成、信息)的元组。但是,当相应地运行我的代码时,我收到一个 ValueError:
有问题的代码:
observation, reward, done, info = env.step(new_action)
错误:
3 new_action = env.action_space.sample()
----> 5 observation, reward, done, info = env.step(new_action)
7 # here's a look at what we get back
8 print(f"observation: {observation}, reward: {reward}, done: {done}, info: {info}")
ValueError: too many values to unpack (expected 4)
再添加一个变量即可修复错误:
a, b, c, d, e = env.step(new_action)
print(a, b, c, d, e)
输出:
5 0 True True {'prob': 1.0}
我的解读:
5
应该是观察0
是奖励prob: 1.0
是信息True
已完成那么剩下的布尔值代表什么?
谢谢您的帮助!
完整代码:
import gym
env = gym.make('FrozenLake-v1', new_step_api=True, render_mode='ansi') # build environment
current_obs = env.reset() # start new episode
for e in env.render():
print(e)
new_action = env.action_space.sample() # random action
observation, reward, done, info = env.step(new_action) # perform action, ValueError!
for e in env.render():
print(e)
您可能需要考虑使用新的 API 来创建环境,因为为旧代码提供了临时包装器支持,并且有一天它可能不再向后兼容。使用新的 API 可能会对您的代码产生某些微小的影响(一行 - 不要简单地执行:done = truncated)。
让我们快速了解一下变化。
要使用新的 API,请添加 new_step_api=True 选项(注意:使用最新的 API,不需要 new_step_api 选项),例如
env = gym.make('MountainCar-v0', new_step_api=True)
这会导致 env.step() 方法返回五个项目而不是四个。这个额外的是什么?
这样做是为了消除
done
信号中的歧义。旧 API 中的 done=True
没有区分环境终止和剧集截断。之前通过 TimeLimit 包装器在时间限制的情况下设置 info['TimeLimit.truncated']
来避免此问题。现在所有这些都不需要了,env.step() 函数返回我们:
obs, reward, terminated, truncated , info = env.step(action)
这对您的代码有何影响: 如果您的游戏有某种 max_steps 或超时,除了“termination”变量之外,您还应该读取“truncated”变量来查看游戏是否结束。根据您获得的奖励类型,您可能需要稍微调整一下。最简单的选择就是做一个
done = truncated or terminated
然后继续重用旧代码。
来自代码的文档字符串:
Returns: observation (object): this will be an element of the environment's :attr:`observation_space`. This may, for instance, be a numpy array containing the positions and velocities of certain objects. reward (float): The amount of reward returned as a result of taking the action. terminated (bool): whether a `terminal state` (as defined under the MDP of the task) is reached. In this case further step() calls could return undefined results. truncated (bool): whether a truncation condition outside the scope of the MDP is satisfied. Typically a timelimit, but could also be used to indicate agent physically going out of bounds. Can be used to end the episode prematurely before a `terminal state` is reached. info (dictionary): `info` contains auxiliary diagnostic information (helpful for debugging, learning, and logging). This might, for instance, contain: metrics that describe the agent's performance state, variables that are hidden from observations, or individual reward terms that are combined to produce the total reward. It also can contain information that distinguishes truncation and termination, however this is deprecated in favour of returning two booleans, and will be removed in a future version. (deprecated) done (bool): A boolean value for if the episode has ended, in which case further :meth:`step` calls will return undefined results. A done signal may be emitted for different reasons: >Maybe the task underlying the environment was solved successfully, a certain timelimit was exceeded, or the physics >simulation has entered an invalid state.
第一个布尔值似乎代表一个
terminated
值,即“是否达到 terminal state
(根据任务的 MDP 定义)。在这种情况下,进一步的 step() 调用可能会返回未定义的结果。”
看来第二个代表该值是否已经
truncated
,即你的代理是否越界了?来自文档字符串:
“是否满足 MDP 范围之外的截断条件。通常是一个时间限制,但也可用于指示代理在物理上越界。可用于在达到
之前提前结束事件。”terminal state