PyTorch 中的基本策略梯度实现问题

Question

在 1_simple_pg.py 中，我们将整个批次的状态和动作传递给函数 compute_loss。我们必须计算的损失应该是以下形式：

    u = 0;
    for each trail {  
         u += P(trail | Theta) * R(trail)
    } 
    loss = u / #of trails

然而，它在 compute_loss 中计算的内容似乎是另外一回事：


    for each item in the batch {
        x = P(a|s) * R(trail)
    }
    loss = X / #of items in the batch

我对这个和pytorch都是新手，所以我上面的理解可能不正确。

有人可以澄清以上吗？

谢谢