我正在尝试在 R 中实现 Q-Learning 算法:
# Define the map
map <- matrix(c(0, 1, 1, 0, 0, 0, 0, 1), nrow = 2, ncol = 4, byrow = TRUE)
# State labels
rownames(map) <- c("Start", "End")
# Action labels
colnames(map) <- c("Up", "Down", "Left", "Right")
# Rewards for each state-action pair
rewards <- matrix(c(-1, -1, -1, -1, -1, -1, -1, 10), nrow = 2, ncol = 4, byrow = TRUE)
# Q-Learning Algorithm
q_learning <- function(P, R, gamma = 0.9, alpha = 0.1, epsilon = 0.1, max_iter = 1000) {
# Initialize the Q-value function
Q <- matrix(rep(0, nrow(P) * ncol(P)), nrow = nrow(P), ncol = ncol(P))
# Initialize the state
state <- sample(1:nrow(P), 1)
# Iterate until convergence or maximum iterations reached
for (i in 1:max_iter) {
# Choose an action using epsilon-greedy policy
if (runif(1) < epsilon) {
action <- sample(1:ncol(P), 1)
} else {
action <- which.max(Q[state, ])
}
# Observe the next state and reward
prob <- P[state, action]
next_state <- sample(1:nrow(P), 1, prob = prob)
reward <- R[state, action]
# Update the Q-value function
Q[state, action] <- Q[state, action] + alpha * (reward + gamma * max(Q[next_state, ]) - Q[state, action])
# Update the state
state <- next_state
}
# Derive the optimal policy (argmax in R using the which.max)
policy <- apply(Q, 1, which.max)
# Return the Q-value function and policy
return(list(Q = Q, policy = policy))
}
# Run the Q-Learning Algorithm on the map
q_learning(P = map, R = rewards, gamma = 0.9, alpha = 0.1, epsilon = 0.1, max_iter = 1000)
我收到样本函数错误,概率数不正确。
Error in sample.int(length(x), size, replace, prob) :
incorrect number of probabilities
我该如何解决?
我不熟悉这个算法,但是,通过查看代码猜测,你可以试试
prob <- P[ , action]
这将创建一个长度为
nrow(P)
的向量。您将需要自己完成逻辑!