稳定基线 3 在剧集被截断时抛出 ValueError

问题描述 投票:0回答:1

所以我试图在我的自定义

gymnasium
环境槽中训练一个代理
stablebaselines3
,它看起来总是随机崩溃并抛出以下
ValueError

Traceback (most recent call last):
  File "C:\Users\bo112\PycharmProjects\ecocharge\code\Simulation Env\prototype_visu.py", line 684, in <module>
    model.learn(total_timesteps=time_steps, tb_log_name=log_name)
  File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\ppo\ppo.py", line 315, in learn
    return super().learn(
  File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\common\on_policy_algorithm.py", line 277, in learn
    continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps)
  File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\common\on_policy_algorithm.py", line 218, in collect_rollouts
    terminal_obs = self.policy.obs_to_tensor(infos[idx]["terminal_observation"])[0]
  File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\common\policies.py", line 256, in obs_to_tensor
    vectorized_env = vectorized_env or is_vectorized_observation(obs_, obs_space)
  File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\common\utils.py", line 399, in is_vectorized_observation
    return is_vec_obs_func(observation, observation_space)  # type: ignore[operator]
  File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\common\utils.py", line 266, in is_vectorized_box_observation
    raise ValueError(
ValueError: Error: Unexpected observation shape () for Box environment, please use (1,) or (n_env, 1) for the observation shape.

我不知道为什么观察形状/内容会改变,因为它根本不会改变状态获取其值的方式。

我发现,每当代理第一次“存活”整个情节并且使用截断而不是终止时,它就会崩溃。返回

truncated
terminated
是否有某种我不知道的奇怪怪癖?因为我找不到我的步骤函数中的错误。

    def step(self, action):

        ...  # handling the action etc.

        reward = 0
        truncated = False
        terminated = False
        # Check if time is over/score too low - else reward function
        if self.n_step >= self.max_steps:
            truncated = True
            print('truncated')
        elif self.score < -1000:
            terminated = True
            # print('terminated')
        else:
            reward = self.reward_fnc_distance()

        self.score += reward
        self.d_score.append(self.score)
        self.n_step += 1

        # state: [current power, peak power, fridge 1 temp, fridge 2 temp, [...] , fridge n temp]
        self.state['current_power'] = self.d_power_sum[-1]
        self.state['peak_power'] = self.peak_power
        for i in range(self.n_fridges):
            self.state[f'fridge{i}_temp'] = self.d_fridges_temp[i][-1]
            self.state[f'fridge{i}_on'] = self.fridges[i].on

        if self.logging:
            print(f'score: {self.score}')

        if (truncated or terminated) and self.logging:
            self.save_run()

        return self.state, reward, terminated, truncated, {}

这是训练模型的一般设置:

hidden_layer = [64, 64, 32]
time_steps = 1000_000
learning_rate = 0.003
log_name = f'PPO_{int(time_steps/1000)}k_lr{str(learning_rate).replace(".", "_")}'
vec_env = make_vec_env(env_id=ChargeEnv, n_envs=4)
model = PPO('MultiInputPolicy', vec_env, verbose=1, tensorboard_log='tensorboard_logs/',
            policy_kwargs={'net_arch': hidden_layer, 'activation_fn': th.nn.ReLU}, learning_rate=learning_rate,
            device=th.device("cuda" if th.cuda.is_available() else "cpu"), batch_size=128)
model.learn(total_timesteps=time_steps, tb_log_name=log_name)
model.save(f'models/{log_name}')
vec_env.close()

如上所述,剧集在抛出

truncated
时才获得
ValueError
,反之亦然,所以我很确定一定是这样。


编辑:

从下面的答案中,我发现问题是简单地将

self.state
的所有 float/Box 值放入 numpy 数组中,然后再返回它们,如下所示:

self.state['current_power'] = np.array([self.d_power_sum[-1]], dtype='float32')
self.state['peak_power'] = np.array([self.peak_power], dtype='float32')
for i in range(self.n_fridges):
    self.state[f'fridge{i}_temp'] = np.array([self.d_fridges_temp[i][-1]], dtype='float32')
    self.state[f'fridge{i}_on'] = self.fridges[i].on

(注意:dtype 规范本身并不是必需的,它对于使用

SubprocVecEnv
中的
stable_baselines3
很重要)

python reinforcement-learning openai-gym stable-baselines
1个回答
2
投票

问题很可能出在您的自定义环境定义中 (

ChargeEnv
)。该错误表明它的观察形状错误(它是空的)。您应该检查您的
ChargeEnv.observation_space

如果您想创建自定义环境,请务必阅读文档以正确设置它(https://gymnasium.farama.org/tutorials/gymnasium_basics/environment_creation/#declaration-and-initializationhttps: //stable-baselines3.readthedocs.io/en/master/guide/custom_env.html).

这是

ChargeEnv
的示例实现,其中观察空间被正确定义:

import gymnasium as gym
from gymnasium import spaces

class ChargeEnv(gym.Env):
    def __init__(self, n_fridges=2):
        super().__init__()

        # Define observation space
        observation_space_dict = {
            'current_power': spaces.Box(low=0, high=100, shape=(1,), dtype=np.float32),
            'peak_power': spaces.Box(low=0, high=100, shape=(1,), dtype=np.float32)
        }

        for i in range(n_fridges):
            observation_space_dict[f'fridge{i}_temp'] = spaces.Box(low=-10, high=50, shape=(1,), dtype=np.float32)
            observation_space_dict[f'fridge{i}_on'] = spaces.Discrete(2)  # 0 or 1 (off or on)

        self.observation_space = spaces.Dict(observation_space_dict)

        # Other environment-specific variables
        self.n_fridges = n_fridges
        # Initialize other variables as needed

    def reset(self):
        # Reset environment to initial state
        # Initialize state variables, e.g., current_power, peak_power, fridge temperatures, etc.
        # Return initial observation
        initial_observation = {
            'current_power': np.array([50.0]),
            'peak_power': np.array([100.0])
        }
        for i in range(self.n_fridges):
            initial_observation[f'fridge{i}_temp'] = np.array([25.0])  # Example initial temperature
            initial_observation[f'fridge{i}_on'] = 0  # Example: Fridge initially off

        return initial_observation

    def step(self, action):
        # Implement step logic (similar to your existing step function)
        # Update state variables, compute rewards, check termination conditions, etc.
        # Return observation, reward, done flag, and additional info
© www.soinside.com 2019 - 2024. All rights reserved.