OpenAI GYM 的 env.step():值是多少?

问题描述 投票:0回答:2

我正在使用Python3.10了解OpenAI的GYM(0.25.1),健身房的环境设置为

'FrozenLake-v1
(代码如下)。

根据 documentation,调用

env.step()
应返回一个包含 4 个值(观察、奖励、完成、信息)的元组。但是,当相应地运行我的代码时,我收到一个 ValueError:

有问题的代码:

observation, reward, done, info = env.step(new_action)

错误:

      3 new_action = env.action_space.sample()
----> 5 observation, reward, done, info = env.step(new_action)
      7 # here's a look at what we get back
      8 print(f"observation: {observation}, reward: {reward}, done: {done}, info: {info}")

ValueError: too many values to unpack (expected 4)

再添加一个变量即可修复错误:

a, b, c, d, e = env.step(new_action)
print(a, b, c, d, e)

输出:

5 0 True True {'prob': 1.0}

我的解读:

  • 5
    应该是观察
  • 0
    是奖励
  • prob: 1.0
    是信息
  • 其中一个
    True
    已完成

那么剩下的布尔值代表什么?

谢谢您的帮助!


完整代码:

import gym

env = gym.make('FrozenLake-v1', new_step_api=True, render_mode='ansi') # build environment

current_obs = env.reset() # start new episode

for e in env.render():
    print(e)
    
new_action = env.action_space.sample() # random action

observation, reward, done, info = env.step(new_action) # perform action, ValueError!

for e in env.render():
    print(e)
python valueerror openai-gym
2个回答
12
投票

您可能需要考虑使用新的 API 来创建环境,因为为旧代码提供了临时包装器支持,并且有一天它可能不再向后兼容。使用新的 API 可能会对您的代码产生某些微小的影响(一行 - 不要简单地执行:done = truncated)。

让我们快速了解一下变化。

要使用新的 API,请添加 new_step_api=True 选项(注意:使用最新的 API,不需要 new_step_api 选项),例如

env = gym.make('MountainCar-v0', new_step_api=True)

这会导致 env.step() 方法返回五个项目而不是四个。这个额外的是什么?

  • 好吧,在旧 API 中 - 如果剧集以任何方式结束,done 就会返回 True。
  • 在新的 API 中,done 分为 2 部分:
  • terminate=True 如果环境终止(例如,由于任务完成、失败等)
  • truncated=如果剧集由于时间限制或未定义为任务 MDP 一部分的原因而被截断,则为 True。

这样做是为了消除

done
信号中的歧义。旧 API 中的
done=True
没有区分环境终止和剧集截断。之前通过 TimeLimit 包装器在时间限制的情况下设置
info['TimeLimit.truncated']
来避免此问题。现在所有这些都不需要了,env.step() 函数返回我们:

obs, reward, terminated, truncated , info = env.step(action)

这对您的代码有何影响: 如果您的游戏有某种 max_steps 或超时,除了“termination”变量之外,您还应该读取“truncated”变量来查看游戏是否结束。根据您获得的奖励类型,您可能需要稍微调整一下。最简单的选择就是做一个

done = truncated or terminated 

然后继续重用旧代码。


2
投票

来自代码的文档字符串

       Returns:
           observation (object): this will be an element of the environment's :attr:`observation_space`.
               This may, for instance, be a numpy array containing the positions and velocities of certain objects.
           reward (float): The amount of reward returned as a result of taking the action.
           terminated (bool): whether a `terminal state` (as defined under the MDP of the task) is reached.
               In this case further step() calls could return undefined results.
           truncated (bool): whether a truncation condition outside the scope of the MDP is satisfied.
               Typically a timelimit, but could also be used to indicate agent physically going out of bounds.
               Can be used to end the episode prematurely before a `terminal state` is reached.
           info (dictionary): `info` contains auxiliary diagnostic information (helpful for debugging, learning, and logging).
               This might, for instance, contain: metrics that describe the agent's performance state, variables that are
               hidden from observations, or individual reward terms that are combined to produce the total reward.
               It also can contain information that distinguishes truncation and termination, however this is deprecated in favour
               of returning two booleans, and will be removed in a future version.
           (deprecated)
           done (bool): A boolean value for if the episode has ended, in which case further :meth:`step` calls will return undefined results.
               A done signal may be emitted for different reasons: >Maybe the task underlying the environment was solved successfully,
               a certain timelimit was exceeded, or the physics >simulation has entered an invalid state.

第一个布尔值似乎代表一个

terminated
值,即“是否达到
terminal state
(根据任务的 MDP 定义)。在这种情况下,进一步的 step() 调用可能会返回未定义的结果。”

看来第二个代表该值是否已经

truncated
,即你的代理是否越界了?来自文档字符串:

“是否满足 MDP 范围之外的截断条件。通常是一个时间限制,但也可用于指示代理在物理上越界。可用于在达到

terminal state
之前提前结束事件。”

© www.soinside.com 2019 - 2024. All rights reserved.