强化学习 - VPG:标量变量索引错误的无效索引

问题描述 投票:0回答:1

我正在尝试运行vanilla策略梯度算法并渲染Open AI环境"CartPole-v1"

下面给出了算法的代码,运行良好,没有任何错误。这个代码的Jupyer笔记本可以找到here

en%pylab inline
import tensorflow as tf
import tensorflow.keras.backend as K
import numpy as np
import gym
from tqdm import trange
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam, SGD
from tensorflow.keras.layers import *

env = gym.make("CartPole-v1")
env.observation_space, env.action_space

x = in1 = Input(env.observation_space.shape)
x = Dense(32)(x)
x = Activation('tanh')(x)
x = Dense(env.action_space.n)(x)
x = Lambda(lambda x: tf.nn.log_softmax(x, axis=-1))(x)
m = Model(in1, x)
def loss(y_true, y_pred):
  # y_pred is the log probs of the actions
  # y_true is the action mask weighted by sum of rewards
  return -tf.reduce_sum(y_true*y_pred, axis=-1)
m.compile(Adam(1e-2), loss)
m.summary()
lll = []

# this is like 5x faster than calling m.predict and picking in numpy
pf = K.function(m.layers[0].input, tf.random.categorical(m.layers[-1].output, 1)[0])

tt = trange(40)
for epoch in tt:
  X,Y = [], []
  ll = []
  while len(X) < 8192:
    obs = env.reset()
    acts, rews = [], []
    while True:
      # pick action
      #act_dist = np.exp(m.predict_on_batch(obs[None])[0])
      #act = np.random.choice(range(env.action_space.n), p=act_dist)

      # pick action (fast!)
      act = pf(obs[None])[0]

      # save this state action pair
      X.append(np.copy(obs))
      acts.append(act)

      # take the action
      obs, rew, done, _ = env.step(act)
      rews.append(rew)

      if done:
        for i, act in enumerate(acts):
          act_mask = np.zeros((env.action_space.n))
          act_mask[act] = np.sum(rews[i:])
          Y.append(act_mask)
        ll.append(np.sum(rews))
        break

  loss = m.train_on_batch(np.array(X), np.array(Y))
  lll.append((np.mean(ll), loss))
  tt.set_description("ep_rew:%7.2f    loss:%7.2f" % lll[-1])
  tt.refresh()

plot([x[0] for x in lll], label="Mean Episode Reward")
plot([x[1] for x in lll], label="Epoch Loss")
plt.legend()

enter image description here

当我尝试渲染环境时,我得到一个IndexError:

obs = env.reset()
rews = []
while True:
  env.render()
  pred, act = [x[0] for x in pf(obs[None])]
  obs, rew, done, _ = env.step(np.argmax(pred))
  rews.append(rew)
  time.sleep(0.05)
  if done:
    break
print("ran %d steps, got %f reward" % (len(rews), np.sum(rews)))

in(.0)3 while True:4 env.render()----> 5 pred,act = [x [0] for x in pf(obs [None])] 6 obs,rew,done,_ = env.step(np.argmax(pred))7 rews.append(rew)

IndexError:标量变量的无效索引。

我读到当你试图索引numpy标量如numpy.int64numpy.float64时会发生这种情况,但是我不确定错误源于何处以及我应该如何解决这个问题。任何帮助或建议将不胜感激。

python tensorflow openai-gym
1个回答
2
投票

看起来你可能已经改变了pf的工作方式,但忘了更新渲染代码。

试试这个(我还没有测试过):

  act, = pf(obs[None])  # same as pf(obs[None])[0] but asserts shape
  obs, rew, done, _ = env.step(act)

这将在训练时随机选择动作 - 如果你想要贪婪动作,你需要更改一些东西。

© www.soinside.com 2019 - 2024. All rights reserved.