我正在尝试使用 numpy 解决 GridWorld 问题。这由一个 3x3 迷宫组成,最终状态为 (3,2) 和 (3,3),奖励分别为 -1 和 +1。通过动态规划,我想确定每个状态的值。机器人可以向各个方向移动,以 0.8 的概率采取所需的动作,以 0.1 的概率向右或向左移动。
我有一些疑问,因为我们考虑了三个试验: 轨迹 1: <(1,1), 0> → E →<(1,2), 0> → E →<(1,3), 0> → S →<(2,3), 0> → S →<(3,3), 1> 轨迹 2: <(1,1), 0> → E →<(1,2), 0> → E →<(2,2), 0> → N →<(1,2), 0> → E →<(1,3), 0> → S →<(2,3), 0> → S →<(3,3), 1> 轨迹 3: <(1,1), 0> → E →<(2,1), 0> → N →<(1,1), 0> → E →<(1,2), 0> → E →<(1,3), 0> → S →<(2,3), 0> → S →<(3,3), 1>
我不确定这些试验是否与代码实现相关。然而,它们对我使用贝尔曼公式手动近似状态值很有用。
我创建了函数“calculateStateReward(initialState, action)”,其中通过生成概率= random.uniform(0,1),我定义了机器人执行所需动作并移动到相邻状态的概率在迷宫内,或者执行向右或向左的移动。此外,在这个函数中,奖励是针对终端状态以及棋盘边界定义的。该函数返回最终状态和奖励。我需要知道如何使用提供的试验来实现动态规划迭代算法。
这是一个基本示例,您可能需要根据实现的具体情况和 GridWorld 问题的要求对其进行调整。
import numpy as np
# Initialize state values
V = np.zeros((3, 3))
V[2, 1] = 1 # Terminal state (3,2)
V[2, 2] = -1 # Terminal state (3,3)
# Define transition probabilities
prob_desired = 0.8
prob_right = prob_left = 0.1
# Define discount factor
gamma = 0.9
# Dynamic programming iteration
num_iterations = 1000 # You may adjust the number of iterations
for _ in range(num_iterations):
new_V = np.copy(V)
for i in range(3):
for j in range(3):
if (i, j) == (2, 1) or (i, j) == (2, 2): # Skip terminal states
continue
# Calculate expected value using Bellman equation
action_values = []
for action in ["N", "E", "S", "W"]:
# Calculate the expected value for each action
new_i, new_j, reward = calculateStateReward((i, j), action)
action_values.append(prob_desired * (reward + gamma * V[new_i, new_j]))
# Update state value
new_V[i, j] = np.sum(action_values)
# Check for convergence (you may use a threshold)
if np.max(np.abs(new_V - V)) < 0.01:
break
V = new_V
# Print the final state values
print("Final State Values:")
print(V)