在多臂老虎机问题上使用神经网络强化学习

问题描述 投票:0回答:0

出于教育目的,我正在尝试针对多臂强盗问题实施一个非常简单的带有强化学习(Q 学习)的神经网络。到目前为止,我已经在没有神经网络的情况下成功地实现了它(来源:Reinforcement Learning: Life is a Maze):

set.seed(3141) # for reproducibility
 
# Q-learning update function
update <- function(i, r) {
  Q[i] <<- Q[i] + 1/(k[i]+1) * (r-Q[i]) # Q-learning function
  k[i] <<- k[i] + 1 # one more game played on i'th bandit
}
 
# simulate game on one-armed bandit i
ret <- function(i) {
  round(rnorm(1, mean = rets[i]))
}
 
# chose which bandit to play
which.bandit <- function() {
  p <- runif(1)
  ifelse(p >= epsilon, which.max(Q), sample(1:n, 1))
}
 
epsilon <- 0.1 # switch in epsilon percent of cases
rets <- c(4, 5, 4, 4, 4) # average returns of bandits
n <- length(rets)
Q <- rep(0, n) # initialize return vector
k <- rep(0, n) # initialize vector for games played on each bandit
N <- 1000 # no. of runs
R <- 0 # sum of returns
 
for (j in 1:N) {
  i <- which.bandit() # chose bandit
  r <- ret(i) # simulate bandit
  R <- R + r # add return of bandit to overall sum of returns
  update(i, r) # calling Q-learning update function
}
 
which.max(Q) # which bandit has the highest return?
## [1] 2
 
Q
## [1] 4.000000 5.040481 4.090909 4.214286 3.611111
 
k
## [1]  32 914  22  14  18
 
N * max(rets) # theoretical max. return
## [1] 5000
 
R
## [1] 4949
 
R / (N * max(rets)) # percent reached of theoretical max
## [1] 0.9898

现在,当我正确理解神经网络作为强化学习算法的“记忆”时,实现应该不会太复杂,我想使用非常简单的

neuralnet
包。

到目前为止,我的结果平平无奇,这是我尝试过的:

library(neuralnet)
set.seed(271) # for reproducibility

epsilon <- 0.1 # switch in epsilon percent of cases
rets <- c(4, 5, 4, 4, 4) # average returns of bandits
n <- length(rets)
k <- rep(0, n) # initialize vector for games played on each bandit
N <- 1000 # no. of runs
R <- 0 # sum of returns

# Neural network model
nn <- neuralnet(Q ~ i, data = data.frame(i = 1:n, Q = rep(0, n)), hidden = c(25), linear.output = TRUE, lifesign = "none", algorithm = "rprop+")

# Predict Q-values using the neural network
predict.Q <- function(i) {
  predict(nn, data.frame(i = i))
}
Q <- predict.Q(1:n) # initialize return vector

# Update the neural network using backpropagation
update <- function(i, r) {
  y <- Q[i] + 1/(k[i]+1) * (r-Q[i]) # Q-learning function
  k[i] <<- k[i] + 1 # one more game played on the i'th bandit
  
  # Update the neural network weights
  nn <<- neuralnet(Q ~ i, data = data.frame(i = 1:n, Q = c(Q[-i], y)), hidden = c(25), linear.output = TRUE, startweights = nn$weights, lifesign = "none", algorithm = "rprop+")
}

# simulate game on one-armed bandit i
ret <- function(i) {
  round(rnorm(1, mean = rets[i]))
}

# chose which bandit to play
which.bandit <- function() {
  p <- runif(1)
  Q <- predict.Q(1:n)
  ifelse(p >= epsilon, which.max(Q), sample(1:n, 1))
}

Q <- predict.Q(1:n)

for (j in 1:N) {
  i <- which.bandit() # chose bandit
  r <- ret(i) # simulate bandit
  R <- R + r # add return of bandit to overall sum of returns
  update(i, r) # calling Q-learning update function
}

which.max(Q) # which bandit has the highest return?
## [1] 5
Q
##             [,1]
## [1,] -0.00620397
## [2,]  0.04211132
## [3,] -0.04186197
## [4,] -0.03724244
## [5,]  0.04322625
k
## [1]  51 103  61 112 673
N * max(rets) # theoretical max. return
## [1] 5000
R
## [1] 4072
R / (N * max(rets)) # percent reached of theoretical max
## [1] 0.8144
N * mean(rets) # mean return of naive strategy
## [1] 4200
(R - N * mean(rets)) / (N * mean(rets)) # percentage gain versus expected return with naive strategy
## [1] -0.03047619

我的问题

  1. 这个实现本身是正确的吗?
  2. 可以做些什么来改善它?
  3. 结果如此糟糕是因为 神经网络只是在这里做的错误方法(除了 教育目的只是因为它无论如何都是矫枉过正......但即使 “矫枉过正”无论如何都应该起作用)。

谢谢

r neural-network reinforcement-learning q-learning
© www.soinside.com 2019 - 2024. All rights reserved.