我一直在训练一个强化学习代理来玩 ultimate-tictactoe(井字棋的扩展版本,带有 9x9 棋盘和附加规则)。
我创建了一个
openai gym
环境,并一直在尝试使用 stable_baselines3
PPO 和 DQN 网络来训练代理。然而,代理不断为每个状态选择相同的操作,即使该操作在大多数情况下是无效的。
我认为问题是由我的环境引起的,因为我尝试调整训练的超参数并尝试更改训练网络的类型。我也尝试过改变环境中奖励的值,但没有看到任何改善。
这是我的环境的
constructor
。
def __init__(self):
super(UltimateTicTacToeEnv, self).__init__()
self.reset()
self.action_space = Discrete(81) # 9 boards * 9 squares = 81 actions spaces.
self.observation_space = Box(low=0, high=2, shape=(83,), dtype=np.int) # 81 squares from the board + pointer + current_player
这就是
step
方法。 Board 是另一个处理所有有效操作、当前板以及对板的修改的类。 step
方法:
def step(self, action):
reward = 0
# Since the action is from 0 to 80, it gets the board and the square that the action corresponds to
board = action // 9
square = action % 9
self.board.update()
if self.board.isValid(board, square): # checks if the move is valid
reward += 1 # increases the reward if the move is valid
self.board.addValue(self.current_player, board, square) # adds move to board
self.board.update() # updates the board with the action
if (Board.hasWon(self.board.values[board]) == self.current_player):# checks if player won mini-3x3 in which action was played
reward += 1 # increases reward if agent has won the mini 3x3 square in which the action was taken
done, winner = self.check_game_over(board, square) # checks if game is over, and who won if it is over
if done:
if (winner == self.current_player): reward += 5 # reward for agent winning game
self.current_player = 3 - self.current_player # switching between players
else:
reward -= 1 # reward is decreased if the agent takes an invalid action
done = False
return self.get_state(), reward, done, {} # get_state() returns a numpy array of length 83. The first 81 elements are the board. other 2 are pointer in which next move should be played, and current_player
这是我用来训练 PPO 的代码:
policy_kwargs = dict(
net_arch=dict(pi=[83, 256, 256, 256, 81], vf=[83, 256, 256, 256, 81]),
)
model = PPO("MlpPolicy", env, verbose=1, learning_rate=2.5e-3, n_steps=2048, batch_size=64,
n_epochs=10, gamma=0.99, gae_lambda=0.95, clip_range=0.2, ent_coef=0.005, policy_kwargs=policy_kwargs, device="cuda")
这是我用来训练 DQN 的代码:
policy_kwargs = dict(
net_arch=[83, 256, 256, 256, 81],
)
model = DQN("MlpPolicy", env, verbose=1, learning_rate=2.5e-3, policy_kwargs=policy_kwargs, device='cuda')
对于可能导致代理为每个州选择相同操作的问题有什么建议吗?关于如何解决这个问题有什么建议吗?