MuZero伪代码中的奖励值是否未对齐？

Question

MuZero是一种深度强化学习技术，刚刚发布，我一直在尝试通过在媒体上查看它的pseudocode和此helpful tutorial来实现它。

但是，在使用伪代码进行培训期间，如何处理奖励问题使我感到困惑，如果有人可以验证我在正确地阅读代码，并且如果是的话，请解释这种培训算法为何有效，这太好了。] >

这里是训练功能（来自pseudocode：

def update_weights(optimizer: tf.train.Optimizer, network: Network, batch,
                   weight_decay: float):
  loss = 0
  for image, actions, targets in batch:
    # Initial step, from the real observation.
    value, reward, policy_logits, hidden_state = network.initial_inference(
        image)
    predictions = [(1.0, value, reward, policy_logits)]

    # Recurrent steps, from action and previous hidden state.
    for action in actions:
      value, reward, policy_logits, hidden_state = network.recurrent_inference(
          hidden_state, action)
      predictions.append((1.0 / len(actions), value, reward, policy_logits))

      hidden_state = tf.scale_gradient(hidden_state, 0.5)

    for prediction, target in zip(predictions, targets):
      gradient_scale, value, reward, policy_logits = prediction
      target_value, target_reward, target_policy = target

      l = (
          scalar_loss(value, target_value) +
          scalar_loss(reward, target_reward) +
          tf.nn.softmax_cross_entropy_with_logits(
              logits=policy_logits, labels=target_policy))

      loss += tf.scale_gradient(l, gradient_scale)

  for weights in network.get_weights():
    loss += weight_decay * tf.nn.l2_loss(weights)

  optimizer.minimize(loss)
我对损失中的reward感兴趣。请注意，损耗从predictions中获取所有值。添加到reward的第一个predictions来自network.initial_inference功能。之后，len(actions)还有更多的predictions奖励，所有奖励都来自network.recurrent_inference功能。

基于教程，initial_inference和recurrent_inference函数是基于3个不同的函数构建的：

Prediction输入：内部游戏状态。输出：政策，价值（未来可能获得的最佳回报的预计总和）
Dynamics
Representation

initial_inference功能处于外部游戏状态，使用representation功能将其转换为内部状态，然后在该内部游戏状态上使用prediction功能。它输出内部状态，策略和值。

recurrent_inference功能采用内部游戏状态和动作。它使用dynamics功能获取新的内部游戏状态并从旧游戏状态和动作中获得奖励。然后将prediction函数应用于新的内部游戏状态，以获取该新内部状态的策略和值。因此，最终输出是一个新的内部状态，奖励，政策和价值。

但是，在伪代码中，initial_inference函数也返回奖励

。

我的主要问题：该奖励代表什么？

在the tutorial中，他们只是隐式地假设initial_inference函数的奖励为0。（请参见教程中的this image。）那么，这是怎么回事？实际上没有奖励，所以initial_inference总是返回0作为奖励？

让我们假设是这样。

在此假设下，predictions列表中的第一个奖励将是initial_inference函数将为该奖励返回的0。然后，在损失中，该0将与target列表的第一个元素进行比较。

target的创建方法：

  def make_target(self, state_index: int, num_unroll_steps: int, td_steps: int,
                  to_play: Player):
    # The value target is the discounted root value of the search tree N steps
    # into the future, plus the discounted sum of all rewards until then.
    targets = []
    for current_index in range(state_index, state_index + num_unroll_steps + 1):
      bootstrap_index = current_index + td_steps
      if bootstrap_index < len(self.root_values):
        value = self.root_values[bootstrap_index] * self.discount**td_steps
      else:
        value = 0

      for i, reward in enumerate(self.rewards[current_index:bootstrap_index]):
        value += reward * self.discount**i  # pytype: disable=unsupported-operands

      if current_index < len(self.root_values):
        targets.append((value, self.rewards[current_index],
                        self.child_visits[current_index]))
      else:
        # States past the end of games are treated as absorbing states.
        targets.append((0, 0, []))
    return targets
此函数返回的targets成为target函数中的update_weights列表。因此，targets中的第一个值为self.rewards[current_index]。 self.rewards是玩游戏时收到的所有奖励的列表。唯一的编辑时间是在此功能apply：

中

  def apply(self, action: Action):
    reward = self.environment.step(action)
    self.rewards.append(reward)
    self.history.append(action)
apply函数仅在这里调用：

# Each game is produced by starting at the initial board position, then
# repeatedly executing a Monte Carlo Tree Search to generate moves until the end
# of the game is reached.
def play_game(config: MuZeroConfig, network: Network) -> Game:
  game = config.new_game()

  while not game.terminal() and len(game.history) < config.max_moves:
    # At the root of the search tree we use the representation function to
    # obtain a hidden state given the current observation.
    root = Node(0)
    current_observation = game.make_image(-1)
    expand_node(root, game.to_play(), game.legal_actions(),
                network.initial_inference(current_observation))
    add_exploration_noise(config, root)

    # We then run a Monte Carlo Tree Search using only action sequences and the
    # model learned by the network.
    run_mcts(config, root, game.action_history(), network)
    action = select_action(config, len(game.history), root, network)
    game.apply(action)
    game.store_search_statistics(root)
  return game
对我来说，看起来像是[[每执行一次动作，就会产生奖励]

。因此，self.rewards列表中的第一个奖励应该是在游戏中采取第一个动作后得到的奖励。如果current_index = 0中的self.rewards[current_index]，问题将变得明显。在这种情况下，predictions列表的第一个奖励将为0，因为它总是如此。但是，targets列表将获得完成第一个操作的奖励。
所以，对我来说，
似乎奖励没有对齐。
[如果继续，则predictions列表中的第二个奖励将是recurrent_inference中完成
first
动作的奖励。但是，targets列表中的第二个奖励将是为完成second动作而存储在游戏中的奖励。因此，总的来说，我有三个相互补充的问题：
initial_inference的奖励代表什么？（什么？）[如果为0，并且应该表示奖励，predictions和targets之间的奖励是否未对齐？（即predictions中的第二个奖励实际上应该与targets中的第一个奖励相匹配吗？）
如果它们未对齐，网络是否仍将训练并正常工作？
（另一个需要注意的是，尽管存在这种未对准（假设存在未对准），但predictions和targets的长度确实具有相同的长度。目标长度由for current_index in range(state_index, state_index + num_unroll_steps + 1)中的线make_target定义在上面的函数中，我们还计算出predictions的长度为len(actions) + 1，并且len(actions)由g.history[i:i + num_unroll_steps]函数中的sample_batch定义（请参见the pseudocode）。列表是相同的。）
发生了什么事？
MuZero是一种深度强化学习技术，刚刚发布，我一直在尝试通过查看其伪代码和此有关Medium的有用教程来实现它。但是，有些东西...

Answer 1

作者在这里。

initial_inference的奖励代表什么？

MuZero伪代码中的奖励值是否未对齐？

问题描述投票：0回答：1

1个回答

最新问题

MuZero伪代码中的奖励值是否未对齐？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1