如何采取最佳行动而不是采取随机行动

Question

我的代理人一直采取随机行动，因此算法没有正确训练。如何确保它采取存储在“next_action，ArgMax = custom_argmax（Q_value）”行中的最佳操作。函数custom_argmax计算为每个状态，动作对找到的最大Q值。

max_episodes = 10
max_steps_per_episode = 1000

discount_rate = 0.99
exploration_rate = 0.5
max_exploration_rate = 1
min_exploration_rate = 0.1
learning_rate = 0.01
explore_decay_rate = 0.2
errors = []


def play_single_game(max_steps_per_episode, render):
    global errors

    state = env.reset()
    # print('We are resetting: ' )

    action = env.action_space.sample()

    for step in range(max_steps_per_episode - 1):

        # if episode == max_episodes - 1:
        if render:
            env.render()

        # print("This is the Ac:",  a)
        '''
        if step%2 == 0:
            a = 1
        else:
            a = 1
        '''
        new_state, reward, done, info = env.step(action)  # declare all, gets new state from taking certain action
        # print(info)
        next_state = new_state
        # print(reward)
        old_weights = weights.theta.copy()

        if done == True:
            weights.theta += learning_rate * (reward - weights_multiplied_by_features(state, action)) * feature_space(state, action)
            # print("we are done")
            break
        else:
            # not finished
            Q_value= associated_Q_value(next_state)

            exploration_rate_threshold = random.uniform(0, 1)

            next_action, ArgMax = custom_argmax(Q_value)  # is best action

            if exploration_rate_threshold < exploration_rate:  # take random

                r = random.randint(0, len(LEGAL_MOVES) - 1)

                next_action = r

            # we will update Q(s,a) AS we experience the episode
            weights.theta += learning_rate * (reward + discount_rate * ArgMax - weights_multiplied_by_features(state, action)) * feature_space(state, action)

            # next state becomes current state
            state = next_state
            action = next_action

            change_in_weights = np.abs(weights.theta - old_weights).sum()
            errors.append(change_in_weights)

Answer 1

你正在做ε-贪婪的探索。你已经设置了exploration_rate = 0.5，所以你的代理人总是会采取50％的随机行动。这可能太高了，但这并不意味着你的经纪人没有学习。

如果要正确评估代理，则必须运行禁用探索的情节。你不能只是禁用随机动作，因为它可能永远不会尝试不同的动作;它被称为勘探/开发权衡。但是，您可以在座席学习时慢慢拨打探索，例如在你的循环中使用exploration_rate *= 0.999或类似的。

如何采取最佳行动而不是采取随机行动

问题描述投票：2回答：1

1个回答

最新问题

如何采取最佳行动而不是采取随机行动

问题描述 投票：2回答：1

1个回答

最新问题

问题描述投票：2回答：1