加速 pytorch 代码的困难：使用复杂的多对一非线性函数训练 MLP

Question

简而言之：

我的目标是弄清楚是否可以使用特定的复杂非线性函数来替换神经网络中的单个神经元。理想情况下，我想展示我可以在 MNIST 数字图片上进行训练。我已经在 pytorch 中进行了尝试，但它太慢了，主要是因为我无法弄清楚如何并行处理批次和神经元，并且我正在寻找能够显着地实现批量和神经元的想法或方法加快进程。

神经网络中的典型神经元被定义为执行点积，然后对该点积的输出 f(x dot w) 执行非线性函数。

我正在考虑一个多对一非线性函数，而不是 f(x dot w)，它是 x 和 w 的更一般的非线性函数，即 f(x, w)。非线性函数 f(x, w) 采用 X 的一维数组和 W 的一维数组，并返回单个输出。我有执行此计算的 numpy 代码。它对真实的物理系统进行建模，并需要一系列递归积分来计算。在上一个问题中，我了解到我可以将 numpy 代码转换为 pytorch 函数，并且 pytorch 应该能够自动为我执行反向传播梯度。

现在我有了描述非线性函数 f(x,w) 的 pytorch 代码。我想证明我可以用它来学习数字图片，因此我将 MNIST 数字减少到 10x10 像素图像，并建立了一个具有 100 个输入、隐藏大小为 100 和 10 个输出的受 MLP 启发的网络。

更详细地解释这个受 MLP 启发的网络：

第一层由 100 个“神经元”组成，其中典型的神经元被我的非线性函数 f(x, w) 取代。 100 个“神经元”中的每一个都接受 X 的输入，并具有一组不同的权重 w。最后，这 100 个神经元的输出被传递到下一层。下一层只是 10 个神经元，每个神经元的输出用于识别 10 个数字中的每一个。

这是网络前向传播的代码片段：

class MLP(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(MLP, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_classes = num_classes
        self.weights1 = nn.Parameter(torch.randn(input_size, hidden_size))  # weights of size (input_size, hidden_size)
        self.weights2 = nn.Parameter(torch.randn(hidden_size, num_classes)) # weights of size (hidden_size, num_classes)

    def forward(self, x):
        hidden_outputs = []
        for neuron_weights in self.weights1.T:  # loop over each neuron's weights in the first layer
            output_value = input_output_nonlinearity_torch(x.squeeze(), neuron_weights, C_total = C_total, readoutStrength = readoutStrength)
            output_value = F.relu(output_value)  # apply a relu activation function
            hidden_outputs.append(output_value)

       final_outputs = []
       for neuron_weights in self.weights2.T:  # loop over each neuron's weights in the second layer
           output_value = input_output_nonlinearity_torch(hidden_outputs, neuron_weights, C_total = C_total, readoutStrength = readoutStrength)
           final_outputs.append(output_value)

       final_outputs = [output.unsqueeze(0) for output in final_outputs]
       final_output = torch.stack(final_outputs, dim=0)
       final_output = final_output.t()
       return final_output

问题在于，对于单个数字的一次前馈传递训练，该网络的每次迭代训练大约需要 20 分钟。所以，我真的需要弄清楚如何让它更快。

input_output_nonlinearity() 是我的非线性函数 f(x, w)。正如您从代码中看到的，我通过 for 循环遍历每个权重，分别找到网络中每个“神经元”的输出。但原则上，每个神经元是完全独立的，并且可以并行运行。

因此，一种方法是进一步矢量化我的代码。但是，我无法找到一种简单的矢量化方法，可以将矩阵 X 和矩阵 W 传递给 f(x, w)，这样我就可以获得不同神经元的一组输出和一组输入数据（我将在最后给出完整的代码）。对我来说，这似乎实施起来相当具有挑战性（但我确信这是可能的）。

另一个想法是，也许有另一种方式告诉 pytorch 这些计算是完全独立的，因此它可以在幕后进行一些并行处理？如果可能的话，有什么想法吗？还是我必须通过完全矢量化的解决方案来暴力破解？

这是整个事情的代码。我对长度表示歉意，但我想提供完整的代码，以便可以正确识别任何速度低效的情况。

这是网络训练的代码：

from neural_network_pytorch_dot_product import *
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.datasets as datasets
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
import torch.utils.data as data
import numpy
import torch.nn.functional as F

numpoints_for_greens_integrals = 100
C_total = .1
readoutStrength = 1/C_total

def reduce_dataset(dataloader, fraction):
    num_samples = int(len(dataloader.dataset) * fraction)
    indices = torch.randperm(len(dataloader.dataset))[:num_samples]
    new_dataset = data.Subset(dataloader.dataset, indices)
    new_dataloader = data.DataLoader(new_dataset, batch_size=dataloader.batch_size, shuffle=True)
    return new_dataloader


class MLP(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(MLP, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_classes = num_classes
        self.weights1 = nn.Parameter(torch.randn(input_size, hidden_size))  # weights of size (input_size, hidden_size)
        self.weights2 = nn.Parameter(torch.randn(hidden_size, num_classes)) # weights of size (hidden_size, num_classes)

    def forward(self, x):
        hidden_outputs = []
        for neuron_weights in self.weights1.T:  # loop over each neuron's weights in the first layer
            output_value = input_output_nonlinearity_torch(x.squeeze(), neuron_weights, C_total = C_total, readoutStrength = readoutStrength)
            output_value = F.relu(output_value)  # apply a relu activation function
            hidden_outputs.append(output_value)

        final_outputs = []
        for neuron_weights in self.weights2.T:  # loop over each neuron's weights in the second layer
            output_value = input_output_nonlinearity_torch(hidden_outputs, neuron_weights, C_total = C_total, readoutStrength = readoutStrength)
            final_outputs.append(output_value)

        final_outputs = [output.unsqueeze(0) for output in final_outputs]
        final_output = torch.stack(final_outputs, dim=0)
        final_output = final_output.t()
        return final_output

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# hyperparameters

input_size = 100
hidden_size = 100
num_classes = 10
num_epochs = 10
batch_size = 1
learning_rate = 0.001

pixelX = 10

# MNIST dataset (28x28 images!)
# train_dataset = datasets.MNIST(root='./data', train=True, transform=transforms.ToTensor(), download=True)
# test_dataset = datasets.MNIST(root='./data', train=False, transform=transforms.ToTensor())

# Compressed images to (10x10 images!)
# Define a new transform to resize the images
resize_transform = transforms.Resize((10, 10))

# MNIST dataset with resize transform
train_dataset = datasets.MNIST(root='./data', train=True, transform=transforms.Compose([
    transforms.ToTensor(),
    resize_transform
]), download=True)

test_dataset = datasets.MNIST(root='./data', train=False, transform=transforms.Compose([
    transforms.ToTensor(),
    resize_transform
]))


train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)

# Update train loader and test loader
train_loader = reduce_dataset(train_loader, 0.01) # reduce to 1% of original size
test_loader = reduce_dataset(test_loader, 0.01) # reduce to 1% of original size


# instantiate the MLP
model = MLP(input_size, hidden_size, num_classes).to(device)

# loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)


# train the model
for epoch in range(num_epochs):
    running_loss = 0.0
    for i, (images, labels) in enumerate(train_loader):
        images = images.reshape(-1, pixelX*pixelX).to(device)
        labels = labels.to(device)
        
        # forward pass
        print('beginning forward pass')
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {running_loss/len(train_loader)}")

# test the model
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.reshape(-1, pixelX*pixelX).to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print('Accuracy of the network on test images: {} %'.format(100 * correct / total))

这是描述非线性函数 f(x, w) 的代码：

import torch
from torch import nn, optim
import numpy as np
from scipy import special

def readinKernel_torch(wdummy, z, Ec, Ep, kval=1, ic = 10**-9*torch.sqrt(3.14/(8*torch.log(torch.tensor([2.]))))*2*3.14*3*10**6, od = 10000, gamma = 2*3.14*18*10**9/(2*3.14*3*10**6), extra = 1, Np = (8*torch.log(torch.tensor([2.]))/(torch.pow(torch.tensor([10.])**-9*(2*3.14*3*10**6),2)*torch.tensor(torch.pi))).pow(0.25)):
    return Ec * kval * torch.special.bessel_j0(2* Ec * kval * torch.sqrt(torch.ger(z, (1 - wdummy))))*torch.exp(-1*1j*Ec**2*extra*repmat_torch(wdummy, len(z))*kval**2*gamma/od)* torch.sqrt(ic)*Ep*Np

def readoutKernel_torch(zdummy, z, B_in, Ec, kval=1):
    return  (Ec * kval *
            steep_sigmoid_torch(torch.sub(z.repeat(zdummy.size(0), 1).T, zdummy), 50) *
            1 / torch.sqrt(torch.clamp(torch.sub(z.repeat(zdummy.size(0), 1).T, zdummy), min=1e-10)) *
            torch.special.bessel_j1(2 * Ec * kval * torch.sqrt(torch.clamp(torch.sub(z.repeat(zdummy.size(0), 1).T, zdummy), min=1e-10))) *
            repmat_torch(B_in, len(z)))

def final_readoutKernel_torch(zdummy, w, Ec, B_in, kval=1):
    #  This is the same Kernel as the readin kernel, but with K(z, w) switched to K(w, z).    
    out = Ec * kval * torch.special.bessel_j0(2* Ec * kval * torch.sqrt(torch.ger(w, (1 - zdummy))))*repmat_torch(B_in, len(w))
    return out

def repmat_torch(arr, num_reps):
    return arr.view(1, -1).repeat(num_reps, 1)

def steep_sigmoid_torch(x, k=50):
    return 1.0 / (1.0 + torch.exp(-k*x))

def complex_trap_torch(z, xvals, axisdim):
    real_values = torch.real(z)
    imaginary_values = torch.imag(z)
    real_integral = torch.trapz(real_values, x=xvals, dim=axisdim)
    imaginary_integral = torch.trapz(imaginary_values, x=xvals, dim=axisdim)
    complexout = real_integral + 1j * imaginary_integral
    return complexout

def spinwave_recursive_calculation_torch(B_in, z_values, w_values, Ec, Ep, c_per_mode = 1):

    readin_values = readinKernel_torch(w_values, z_values, Ec, Ep, kval = c_per_mode)
    readout_values = readoutKernel_torch(z_values, z_values, B_in, Ec, kval = c_per_mode)

    readin_integrals = complex_trap_torch(readin_values, xvals=w_values, axisdim=1)
    readout_integrals = complex_trap_torch(readout_values, xvals=z_values, axisdim=1) 

    spinwave = readin_integrals - readout_integrals + B_in
    return spinwave



def input_output_nonlinearity_torch(x, w, numpoints = 100, C_total = 1, readoutStrength = 1):
    z_values =  torch.linspace(1e-10, 1-1e-10, numpoints)
    w_values =  torch.linspace(1e-10, 1-1e-10, numpoints)

    Bin = torch.zeros(len(z_values), dtype=torch.complex128)
    BoutMatrix = repmat_torch(Bin, len(w))

    c_per_mode = C_total/len(w)

    for i in range(len(w)):
        E_c_val = w[i]
        E_p_val = x[i]
        # print('E_p_val', E_p_val)
        # print('x', x)
        BoutMatrix[i, :] = spinwave_recursive_calculation_torch(Bin, z_values, w_values, E_c_val, E_p_val, c_per_mode)
        Bin = BoutMatrix[i, :]
    Bout = BoutMatrix[-1, :]

    output_Efield_w_z = final_readoutKernel_torch(z_values, w_values, readoutStrength, Bout, kval=1)
    output_Efield_w = torch.trapz(torch.real(output_Efield_w_z), x = z_values, dim =1)
    output_Efield = torch.trapz(torch.real(output_Efield_w), x = w_values, dim = 0)

    return output_Efield

（再次为冗长的代码表示歉意，但我的问题的关键困难恰恰在于很难对如此复杂的事物进行向量化。如果我尝试编写一个更简单的示例，那么如何对其进行向量化和用户对这个简化问题的回答对我没有帮助。）

Answer 1

我认为问题不在于函数慢，而在于for循环太大：

我的计算：

第一个for循环（neuros_weights = 10.000）

            for (
            neuron_weights
        ) in self.weights1.T:  # loop over each neuron's weights in the first layer
            output_value = input_output_nonlinearity_torch(
                x.squeeze(),
                neuron_weights,
                C_total=C_total,
                readoutStrength=readoutStrength,
            )

第二个 for 循环（w = 10.000）

def input_output_nonlinearity_torch(x, w, numpoints = 100, C_total = 1, readoutStrength = 1):
    ...

    for i in range(len(w)):
        ...

所以这个内部函数每次运行都会运行 100.000.000...我读对了吗？

加速 pytorch 代码的困难：使用复杂的多对一非线性函数训练 MLP

问题描述投票：0回答：1

1个回答

最新问题

加速 pytorch 代码的困难：使用复杂的多对一非线性函数训练 MLP

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1