我有一个
.mp4
格式的视频:eval.mp4
。我还有一个经过微调的 pytorch
resnet nn
,我想用它对从视频中读取的单个帧或保存到磁盘的单个 png 文件进行推理
我的预训练
nn
成功使用了我从磁盘加载的 .png
文件,然后执行训练/验证转换。但在推理过程中,我不想将 eval.mp4
视频的每一帧作为 .png
文件写入磁盘,仅用于推断每一帧,而是简单地将每个捕获的帧转换为可以通过以下方式评估的正确格式网络。
我的数据集类/数据加载器看起来像:
# create total dataset, no transforms
class MouseDataset(Dataset):
def __init__(self, csv_file, root_dir, transform=None):
"""
Args:
csv_file (string): Path to the csv file with annotations.
root_dir (string): Directory with all the images.
transform (callable, optional): Optional transform to be applied
on a sample.
"""
self.mouse_frame = pd.read_csv(csv_file)
self.root_dir = root_dir
self.transform = transform
def __len__(self):
return len(self.mouse_frame)
def __getitem__(self, idx):
if torch.is_tensor(idx):
idx = idx.tolist()
# img_name is root_dir+file_name
img_name = os.path.join(self.root_dir,
self.mouse_frame.iloc[idx, 0])
image = Image.open(img_name)
coordinates = self.mouse_frame.iloc[idx, 1:]
coordinates = np.array([coordinates])
if self.transform:
image = self.transform(image)
return (image, coordinates)
# break total dataset into subsets for different transforms
class DatasetSubset(Dataset):
def __init__(self, dataset, transform=None):
self.dataset = dataset
self.transform = transform
def __len__(self):
return len(self.dataset)
def __getitem__(self, index):
# get image
image = self.dataset[index][0]
# transform for input into nn
if self.transform:
image = image.convert('RGB')
image = self.transform(image)
image = image.to(torch.float)
#image = torch.unsqueeze(image, 0)
# get coordinates
coordinates = self.dataset[index][1]
# transform for input into nn
coordinates = coordinates.astype('float').reshape(-1, 2)
coordinates = torch.from_numpy(coordinates)
coordinates = coordinates.to(torch.float)
return (image, coordinates)
# create training / val split
train_split = 0.8
train_count = int(train_split * len(total_dataset))
val_count = int(len(total_dataset) - train_count)
train_subset, val_subset = torch.utils.data.random_split(total_dataset, [train_count, val_count])
# create training / val datasets
train_dataset = DatasetSubset(train_subset, transform = data_transforms['train'])
val_dataset = DatasetSubset(val_subset, transform = data_transforms['val'])
# create train / val dataloaders
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, num_workers=num_workers, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, num_workers=num_workers)
dataloaders_dict = {}
dataloaders_dict['train'] = train_dataloader
dataloaders_dict['val'] = val_dataloader
我的训练与验证转换(对于测试目的来说是相同的):
# Data augmentation and normalization for training
# Just normalization for validation
# required dimensions of input image
input_image_width = 224
input_image_height = 224
# mean and std of RGB pixel intensities
# ImageNet mean [0.485, 0.456, 0.406]
# ImageNet standard deviation [0.229, 0.224, 0.225]
model_mean = [0.485, 0.456, 0.406]
model_std = [0.229, 0.224, 0.225]
data_transforms = {
'train': transforms.Compose([
transforms.Resize((input_image_height, input_image_width)),
transforms.ToTensor(),
transforms.Normalize(model_mean, model_std)
]),
'val': transforms.Compose([
transforms.Resize((input_image_height, input_image_width)),
transforms.ToTensor(),
transforms.Normalize(model_mean, model_std)
]),
}
我尝试做的是从
opencv
vidcapture
对象读取每一帧,使用此 answer转换为
PIL
,然后推断,但我得到的结果与简单读取有很大不同框架,保存为 .png
,然后推断 .png
。
我正在测试的代码:
# Standard imports
import cv2
import numpy as np
import torch
import torchvision
from torchvision import models, transforms
from PIL import Image
# load best model for evaluation
BEST_PATH = 'resnet152_best.pt'
model_ft = torch.load(BEST_PATH)
#print(model_ft)
model_ft.eval()
# Data augmentation and normalization for training
# Just normalization for validation
# required dimensions of input image
input_image_width = 224
input_image_height = 224
# mean and std of RGB pixel intensities
# ImageNet mean [0.485, 0.456, 0.406]
# ImageNet standard deviation [0.229, 0.224, 0.225]
model_mean = [0.485, 0.456, 0.406]
model_std = [0.229, 0.224, 0.225]
data_transforms = {
'train': transforms.Compose([
transforms.Resize((input_image_height, input_image_width)),
transforms.ToTensor(),
transforms.Normalize(model_mean, model_std)
]),
'val': transforms.Compose([
transforms.Resize((input_image_height, input_image_width)),
transforms.ToTensor(),
transforms.Normalize(model_mean, model_std)
]),
}
# Read image
cap = cv2.VideoCapture('eval.mp4')
total_frames = cap.get(7)
cap.set(1, 6840)
ret, frame = cap.read()
cv2.imwrite('eval_6840.png', frame)
png_file = 'eval_6840.png'
# eval png
png_image = Image.open(png_file)
png_image = png_image.convert('RGB')
png_image = data_transforms['val'](png_image)
png_image = png_image.to(torch.float)
png_image = torch.unsqueeze(png_image, 0)
print(png_image.shape)
output = model_ft(png_image)
print(output)
# eval frame
vid_image = Image.fromarray(frame)
vid_image = vid_image.convert('RGB')
vid_image = data_transforms['val'](vid_image)
vid_image = vid_image.to(torch.float)
vid_image = torch.unsqueeze(vid_image, 0)
print(vid_image.shape)
output = model_ft(vid_image)
print(output)
返回:
torch.Size([1, 3, 224, 224])
tensor([[ 0.0229, -0.0990]], grad_fn=<AddmmBackward0>)
torch.Size([1, 3, 224, 224])
tensor([[ 0.0797, -0.2219]], grad_fn=<AddmmBackward0>)
我的问题是:
(1) 为什么opencv帧评估与png文件评估不同?所有转换似乎都是相同的(包括根据注释进行的 RGB 转换)。
(2) 鉴于两个图像都是从视频的同一片段捕获的,如何使帧评估与 png 评估相同?
在此发布此答案,以防对任何人有帮助。
问题是:
png_image = Image.open(png_file)
创建了这种类型的对象:PIL.PngImagePlugin.PngImageFile
。
但是,视频捕获帧会创建一个类型为:
numpy.ndarray
的对象。转换步骤: vid_image = Image.fromarray(frame)
创建类型为 PIL.Image.Image
的对象
我尝试将
PIL.Image.Image
对象转换为 PIL.PngImagePlugin.PngImageFile
,反之亦然,以使它们具有可比性,但使用 PIL
方法 convert
似乎不可能。 其他人似乎也有这个问题。
因此,解决方案是在
numpy.ndarray
类型和 PIL
图像类型之间来回转换,以利用 PIL
所依赖的 pytorch
图像库中的转换功能。可能不是最有效的方法,但最终结果是相同的输入对象和模型预测。
供参考:
# Read image
cap = cv2.VideoCapture('eval.mp4')
total_frames = cap.get(7)
cap.set(1, 6840)
ret, frame = cap.read()
cv2.imwrite('eval_6840.png', frame)
png_file = 'eval_6840.png'
# eval png
png_image = Image.open(png_file)
png_array = np.array(png_image)
png_image = Image.fromarray(png_array)
png_image = data_transforms['val'](png_image)
png_image = png_image.to(torch.float)
png_image = torch.unsqueeze(png_image, 0)
png_image = png_image.to(device)
output = model_ft(png_image)
print(output)
# eval frame
vid_array = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
vid_image = Image.fromarray(vid_array)
vid_image = data_transforms['val'](vid_image)
vid_image = vid_image.to(torch.float)
vid_image = torch.unsqueeze(vid_image, 0)
vid_image = vid_image.to(device)
output = model_ft(vid_image)
print(output)
产量:
tensor([[ 0.0229, -0.0990]], grad_fn=<AddmmBackward0>)
tensor([[ 0.0229, -0.0990]], grad_fn=<AddmmBackward0>)
我可以问一下你用哪个数据集来训练神经网络吗? 谢谢!