这是任务,以我理解的形式:
我有一个带有不同数字的 5x5 网格。该示例的范围可以是 0 到 9。 我有 100 个货币可以花,我们称之为“NPV”,意思是我们手头的货币数量
放置在单元格中的每个钻头花费 10 货币。
使用强化学习,我需要创建一个流程,代理本身进行迭代,每次更新网格时,代理选择具有最高值的单元格来放置钻头,并返回列表中的放置顺序。
所以我经历了无穷无尽的样本,我总是面临同样的问题,一个“无尽”的循环。而且我无法弄清楚在这种情况下该怎么做。你能帮我理解吗,我错过了什么。
Attempt 1 - total reward parameter doesn't change from -100 (Episode 999: Total Reward = -100) 初始问题行:
# Choose actions until all drills are placed
while not done:
# Choose an action based on the current state
action = agent.choose_action(current_state)
# Get the reward for the chosen action
reward = get_reward(current_state, action)
total_reward += reward
# Update the grid
row = action // 5
col = action % 5
grid[row][col] = 1
# Update the state based on the chosen action
next_state = action
# Check if all drills are placed
if total_reward <= -100:
done = True
# Update the Q-table with the chosen action and reward
agent.learn(current_state, action, reward, next_state, done)
# Set the current state to the next state
current_state = next_state
# Print the total reward for the episode
print("Episode {}: Total Reward = {}".format(episode, total_reward))
完整代码:
import random
import numpy as np
# define the grid
grid = []
for i in range(5):
row = []
for j in range(5):
row.append(random.randint(0, 9))
grid.append(row)`
# Print the grid
for row in grid:
print(row)
# Define the Q-learning agent
class QLearningAgent:
def __init__(self, state_size, action_size, learning_rate=0.1, discount_factor=0.95, exploration_rate=1.0,
exploration_decay_rate=0.99):
self.state_size = state_size
self.action_size = action_size
self.learning_rate = learning_rate
self.discount_factor = discount_factor
self.exploration_rate = exploration_rate
self.exploration_decay_rate = exploration_decay_rate
self.q_table = np.zeros((state_size, action_size))
def choose_action(self, state):
if np.random.uniform(0, 1) < self.exploration_rate:
# Explore action space
return np.random.choice(self.action_size)
else:
# Exploit learned values
return np.argmax(self.q_table[state, :])
def learn(self, state, action, reward, next_state, done):
current_q_value = self.q_table[state, action]
next_max_q_value = np.max(self.q_table[next_state, :])
td_target = reward + self.discount_factor * next_max_q_value * (1 - int(done))
td_error = td_target - current_q_value
new_q_value = current_q_value + self.learning_rate * td_error
self.q_table[state, action] = new_q_value
if done:
self.exploration_rate *= self.exploration_decay_rate
# Define the action space as all the cells in the grid
action_space = list(range(25))
def get_reward(state, action):
npv = 0
# Check if action is valid (NPV is sufficient and cell is empty)
row = action // 5
col = action % 5
if grid[row][col] == 0 and npv >= 10:
npv -= 10
return 10
else:
return -10
# Set up
num_episodes = 1000
npv = 100
# Initialize the Q-learning agent
agent = QLearningAgent(state_size=len(action_space), action_size=len(action_space))
# Run the training loop
for episode in range(num_episodes):
# Reset the environment
current_state = 0
total_reward = 0
done = False
# Choose actions until all drills are placed
while not done:
# Choose an action based on the current state
action = agent.choose_action(current_state)
# Get the reward for the chosen action
reward = get_reward(current_state, action)
total_reward += reward
# Update the grid
row = action // 5
col = action % 5
grid[row][col] = 1
# Update the state based on the chosen action
next_state = action
# Check if all drills are placed
if total_reward <= -100:
done = True
# Update the Q-table with the chosen action and reward
agent.learn(current_state, action, reward, next_state, done)
# Set the current state to the next state
current_state = next_state
# Print the total reward for the episode
print("Episode {}: Total Reward = {}".format(episode, total_reward))
# Reset the grid for the next episode
grid = []
for i in range(5):
row = []
for j in range(5):
row.append(random.randint(0, 9))
grid.append(row)