我的代码使用强化学习 (Q) 来分析网格并给出最高数字的顺序是循环的

Question

这是任务，以我理解的形式：

我有一个带有不同数字的 5x5 网格。该示例的范围可以是 0 到 9。我有 100 个货币可以花，我们称之为“NPV”，意思是我们手头的货币数量

放置在单元格中的每个钻头花费 10 货币。

使用强化学习，我需要创建一个流程，代理本身进行迭代，每次更新网格时，代理选择具有最高值的单元格来放置钻头，并返回列表中的放置顺序。

所以我经历了无穷无尽的样本，我总是面临同样的问题，一个“无尽”的循环。而且我无法弄清楚在这种情况下该怎么做。你能帮我理解吗，我错过了什么。

Attempt 1 - total reward parameter doesn't change from -100 (Episode 999: Total Reward = -100) 初始问题行：

# Choose actions until all drills are placed
    while not done:
        # Choose an action based on the current state
        action = agent.choose_action(current_state)

        # Get the reward for the chosen action
        reward = get_reward(current_state, action)
        total_reward += reward

        # Update the grid
        row = action // 5
        col = action % 5
        grid[row][col] = 1

        # Update the state based on the chosen action
        next_state = action

        # Check if all drills are placed
        if total_reward <= -100:
            done = True

        # Update the Q-table with the chosen action and reward
        agent.learn(current_state, action, reward, next_state, done)

        # Set the current state to the next state
        current_state = next_state

    # Print the total reward for the episode
    print("Episode {}: Total Reward = {}".format(episode, total_reward))

完整代码：

import random
import numpy as np

# define the grid
grid = []
for i in range(5):
    row = []
    for j in range(5):
        row.append(random.randint(0, 9))
    grid.append(row)`

# Print the grid
for row in grid:
    print(row)

# Define the Q-learning agent
class QLearningAgent:
def __init__(self, state_size, action_size, learning_rate=0.1, discount_factor=0.95, exploration_rate=1.0,
                 exploration_decay_rate=0.99):
        self.state_size = state_size
        self.action_size = action_size
        self.learning_rate = learning_rate
        self.discount_factor = discount_factor
        self.exploration_rate = exploration_rate
        self.exploration_decay_rate = exploration_decay_rate
        self.q_table = np.zeros((state_size, action_size))

    def choose_action(self, state):
        if np.random.uniform(0, 1) < self.exploration_rate:
            # Explore action space
            return np.random.choice(self.action_size)
        else:
            # Exploit learned values
            return np.argmax(self.q_table[state, :])

    def learn(self, state, action, reward, next_state, done):
        current_q_value = self.q_table[state, action]
        next_max_q_value = np.max(self.q_table[next_state, :])
        td_target = reward + self.discount_factor * next_max_q_value * (1 - int(done))
        td_error = td_target - current_q_value
        new_q_value = current_q_value + self.learning_rate * td_error
        self.q_table[state, action] = new_q_value
        if done:
            self.exploration_rate *= self.exploration_decay_rate

# Define the action space as all the cells in the grid
action_space = list(range(25))

def get_reward(state, action):
    npv = 0
    # Check if action is valid (NPV is sufficient and cell is empty)
    row = action // 5
    col = action % 5
    if grid[row][col] == 0 and npv >= 10:
        npv -= 10
        return 10
    else:
        return -10
# Set up
num_episodes = 1000
npv = 100

# Initialize the Q-learning agent
agent = QLearningAgent(state_size=len(action_space), action_size=len(action_space))

# Run the training loop
for episode in range(num_episodes):
    # Reset the environment
    current_state = 0
    total_reward = 0
    done = False

    # Choose actions until all drills are placed
    while not done:
        # Choose an action based on the current state
        action = agent.choose_action(current_state)

        # Get the reward for the chosen action
        reward = get_reward(current_state, action)
        total_reward += reward

        # Update the grid
        row = action // 5
        col = action % 5
        grid[row][col] = 1

        # Update the state based on the chosen action
        next_state = action

        # Check if all drills are placed
        if total_reward <= -100:
            done = True

        # Update the Q-table with the chosen action and reward
        agent.learn(current_state, action, reward, next_state, done)

        # Set the current state to the next state
        current_state = next_state

    # Print the total reward for the episode
    print("Episode {}: Total Reward = {}".format(episode, total_reward))

    # Reset the grid for the next episode
    grid = []
    for i in range(5):
        row = []
        for j in range(5):
            row.append(random.randint(0, 9))
        grid.append(row)

我的代码使用强化学习 (Q) 来分析网格并给出最高数字的顺序是循环的

问题描述投票：0回答：0

最新问题

我的代码使用强化学习 (Q) 来分析网格并给出最高数字的顺序是循环的

问题描述 投票：0回答：0

最新问题

问题描述投票：0回答：0