Any help will be greatly appreciated. I am very much confused about when i use image_size=(128, 256 ), what will be the patch_size, if i use patch_size= 16 only i can run code upto model training phase. Here is how i set dimensions
--------------------------
#import torch.nn.functional as nnf
# Create image size
IMG_SIZE = 128, 256
# Create transform pipeline manually
manual_transforms = transforms.Compose([
transforms.Resize((IMG_SIZE)),
transforms.ToTensor(),
])
print (f"Manually created transforms: {manual_transforms}")
Output: Manually created transforms: Compose (
Resize (size= (128, 256), interpolation=bilinear, max_size=None,
antialias=warn)
ToTensor()
)
我正在尝试使用视觉转换器进行图像分类 在自定义数据集上。但我的数据集包含所有矩形形状。当我使用图像大小 = 224 时,准确性不太好,因为我猜矩形形状被重塑为 224 X 224 的正方形,因此 在训练阶段,图像特征没有完全提取。我想将图像输入到 Transformers 编码器中,格式为 128 X 256,但是当我设置 patch_siz= 16 时,在进入模型训练阶段时出现错误。 **运行时错误是--> *****RuntimeError:张量的大小必须 尺寸 1 除外。预期尺寸 1024,但实际尺寸为 64 列表中的张量编号 1。 ******
I go errorless results when i use image size as 224X224. the issue is with rectangular shaped dimensions. I am using batch size = 16, image_size=(128, 256), patch size= 16. But is patch_size 16 is ok or it could be patch_size=(16X32) but when try to set patch_size=(16, 32) like this
import torch.nn.functional as nnf
# Create image size
IMG_SIZE = 128, 256
# Create transform pipeline manually
manual_transforms = transforms.Compose([
transforms.Resize((IMG_SIZE)),
transforms.ToTensor(),
])
print(f"Manually created transforms: {manual_transforms}")
--------
OUTPUT=Manually created transforms: Compose(
Resize(size=(128, 256), interpolation=bilinear, max_size=None,
antialias=warn)
ToTensor()
)
--------
# Set the batch size
BATCH_SIZE = 16
# Create data loaders
train_dataloader, test_dataloader, class_names =
create_dataloaders(
train_dir=train_dir,
test_dir=test_dir,
transform=manual_transforms,
batch_size=BATCH_SIZE
)
train_dataloader, test_dataloader, class_names
----------------------------------
OUTPUT:(<torch.utils.data.dataloader.DataLoader at
0x2148d4b3700>,
<torch.utils.data.dataloader.DataLoader at 0x2148d4b3a00>,
['tempered', 'genuine'])
______________________________________________
# 1. Create a class which subclasses nn.Module
class PatchEmbedding(nn.Module):
"""Turns a 2D input image into a 1D sequence learnable
embedding vector.
Args:
in_channels (int): Number of color channels for the input
images. Defaults to 3.
patch_size (int): Size of patches to convert input image
into. Defaults to 16.
embedding_dim (int): Size of embedding to turn image
into. Defaults to 768.
"""
# 2. Initialize the class with appropriate variables
def __init__(self,
in_channels:int=3,
patch_size:int=(16, 32),
embedding_dim:int=768):
super().__init__()
# 3. Create a layer to turn an image into patches
self.patcher = nn.Conv2d(in_channels=in_channels,
out_channels=embedding_dim,
kernel_size=patch_size,
stride=patch_size,
padding=0)
# 4. Create a layer to flatten the patch feature maps into a
single dimension
self.flatten = nn.Flatten(start_dim=2, # only flatten the
feature map dimensions into a
single vector
end_dim=3)
# 5. Define the forward method
def forward(self, x):
# Create assertion to check that inputs are the correct shape
image_resolution = x.shape[-1]
assert image_resolution % patch_size == 0, f"Input image size
must be divisble by patch size,
image shape: {image_resolution}, patch size: {patch_size}"
# Perform the forward pass
x_patched = self.patcher(x)
x_flattened = self.flatten(x_patched)
# 6. Make sure the output shape has the right order
return x_flattened.permute(0, 2, 1)
-------------------------------------------------
# Let's test it on single image
patch_size = (16, 32)
# Set seeds
def set_seeds(seed: int=42):
"""Sets random sets for torch operations.
Args:
seed (int, optional): Random seed to set. Defaults to 42.
"""
# Set the seed for general torch operations
torch.manual_seed(seed)
# Set the seed for CUDA torch operations (ones that happen on the
GPU)
torch.cuda.manual_seed(seed)
set_seeds()
# Create an instance of patch embedding layer
patchify = PatchEmbedding(in_channels=3,
patch_size=(16, 32),
embedding_dim=768)
# Pass a single image through
print(f"Input image shape: {image.unsqueeze(0).shape}")
patch_embedded_image = patchify(image.unsqueeze(0)) # add an
extra batch dimension on the 0th
index, otherwise will error
print(f"Output patch embedding shape:
{patch_embedded_image.shape}")
**here is full code blocks where error traceback is
mentionining**
________________________________________________________________
TypeError Traceback (most recent
call last)
Cell In[111], line 27
25 # Pass a single image through
26 print(f"Input image shape: {image.unsqueeze(0).shape}")
---> 27 patch_embedded_image = patchify(image.unsqueeze(0)) # add
an extra batch dimension on
the 0th index, otherwise will error
28 print(f"Output patch embedding shape:
{patch_embedded_image.shape}")
File ~\AppData\Local\Programs\Python\Python39\lib\site-
packages\torch\nn\modules\module.py:1518,
in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) #
type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File ~\AppData\Local\Programs\Python\Python39\lib\site-
packages\torch\nn\modules\module.py:1527,
in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of
the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or
self._forward_hooks or
self._forward_pre_hooks
1525 or _global_backward_pre_hooks or
_global_backward_hooks
1526 or _global_forward_hooks or
_global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
Cell In[107], line 32, in PatchEmbedding.forward(self, x)
29 def forward(self, x):
30 # Create assertion to check that inputs are the correct
shape
31 image_resolution = x.shape[-1]
---> 32 assert image_resolution % patch_size == 0, f"Input
image size must be divisble by
patch size, image shape: {image_resolution}, patch size:
{patch_size}"
34 # Perform the forward pass
35 x_patched = self.patcher(x)
TypeError: unsupported operand type(s) for %: 'int' and 'tuple'
----------------------------------------------------------
BUt if i set patch_size=16 throghly then after this code got
error
TypeError: cannot unpack non-iterable int object
_________________________________________________________
If i use patch_size=(16,32) then
---------------------
import torch
from vit_pytorch import ViT
class Vit(nn.Module):
"""creates a vision transformer architecture with vit-base
hyperparameters by default."""
def __init__(self, img_size=(128, 256), in_channels=3,
patch_size=(16, 32), num_transformer_layers=12,
embedding_dim=768,
mlp_size=3072, num_classes=1000, dim=1024, depth=6, num_heads=8,
mlp_dim=2048, mlp_dropout=0.1, embedding_dropout=0.1):
super().__init__()
self.img_size = img_size
self.in_channels=in_channels
self.patch_size = patch_size
self.num_transformer_layers = num_transformer_layers
self.embedding_dim = embedding_dim
self.mlp_size=mlp_size
self.num_classes = num_classes
self.dim =dim
self.depth = depth
self.num_heads = num_heads
self.mlp_dim = mlp_dim
self.mlp_dropout = mlp_dropout
self.embedding_dropout = embedding_dropout
# calculate number of patches
height, width = img_size
patch_height, patch_width = patch_size
self.num_patches = (height // patch_height) * (width //
patch_width)
# calculate patch embedding
self.patch_embedding = nn.Conv2d(in_channels=in_channels,
embedding_dim=enbedding_dim, patch_size=patch_size,
stride=patch_size, bias=False)
self.patch_embedding = nn.Conv2d(in_channels=in_channels,
kernel_size=patch_size,
embedding_dim=enbedding_dim,
bias=False)
# calculate class token
self.class_token = nn.Parameter(torch.randn(1, 1, dim))
# calculate positional embeddings
self.row_embeddings = nn.Parameter(torch.randn(height //
patch_height, 1, dim))
self.col_embeddings = nn.Parameter(torch.randn(width //
patch_width, 1, dim))
# calculate transformer blocks
self.transformer_encoder =
nn.ModuleList([TransformerEncoderBlock(
embedding_dim=embedding_dim,
num_heads=num_heads,
mlp_size=mlp_size,
mlp_dropout=mlp_dropout)
for _ in range(num_transformer_layers)
])
# calculate layer normalization
self.layer_norm = nn.LayerNorm(dim)
# calculate classification head
self.classification_head = nn.Linear(dim, num_classes)
def forward(self, x):
# calculate patch embeddings
x = self.patch_embedding(x)
x = x.flatten(2).transpose(1, 2)
# calculate class token
class_token = self.class_token.expand(x.shape[0], -1, -1)
x = torch.cat((class_token, x), dim=1)
# calculate positional embeddings
row_embeddings = self.row_embeddings.repeat(1, x.shape[0], 1)
col_embeddings = self.col_embeddings.repeat(1, x.shape[0], 1)
x = x + row_embeddings + col_embeddings
# calculate transformer blocks
for transformer_block in self.transformer_encoder:
x = transformer_block(x)
# calculate layer normalization
x = self.layer_norm(x)
# calculate classification head
class_logits = self.classification_head(x[:, 0])
return class_logits
_______________________________________
# Train our MOdel
# Create an instance of ViT with the number of classes we're
working with (-,-)
vit = Vit(num_classes=len(class_names))
____________________________________________________________
from going_modular.going_modular import engine
# Setup the optimizer to optimize our ViT model parameters using
hyperparameters from the ViT
paper
optimizer = torch.optim.Adam(params=vit.parameters(),
lr=3e-3, # Base LR from Table 3 for ViT-
* ImageNet-1k
betas=(0.9, 0.999),
weight_decay=0.3) # from the ViT paper
section 4.1 (Training & Fine-
tuning) and Table 3 for ViT-* ImageNet-
1k
# Setup the loss function for multi-class classification
loss_fn = torch.nn.CrossEntropyLoss()
# Set the seeds
set_seeds()
# Train the model and save the training results to a dictionary
results = engine.train(model=vit,
train_dataloader=train_dataloader,
test_dataloader=test_dataloader,
optimizer=optimizer,
loss_fn=loss_fn,
epochs=10,
device=device)
---------------------------------
OUT-
0%|
-----------------------------------------------------------------
-----------------------------------------------------------------
RuntimeError Traceback (most recent
call
last)
Cell In[132], line 16
13 set_seeds()
15 # Train the model and save the training results to a
dictionary
---> 16 results = engine.train(model=vit,
17 train_dataloader=train_dataloader,
18 test_dataloader=test_dataloader,
19 optimizer=optimizer,
20 loss_fn=loss_fn,
21 epochs=10,
22 device=device)
File ~\AppData\Local\Programs\Python\Python39\Scripts\Image-
Classification-Using-Vision-transformer-
main\going_modular\going_modular\engine.py:169, in train(model,
train_dataloader, test_dataloader, optimizer, loss_fn, epochs,
device)
167 # Loop through training and testing steps for a number of
epochs
168 for epoch in tqdm(range(epochs)):
--> 169 train_loss, train_acc = train_step(model=model,
170
dataloader=train_dataloader,
171 loss_fn=loss_fn,
172 optimizer=optimizer,
173 device=device)
174 test_loss, test_acc = test_step(model=model,
175 dataloader=test_dataloader,
176 loss_fn=loss_fn,
177 device=device)
179 # Print out what's happening
File ~\AppData\Local\Programs\Python\Python39\Scripts\Image-
Classification-Using-Vision-transformer-
main\going_modular\going_modular\engine.py:45, in
train_step(model,
dataloader, loss_fn, optimizer, device)
42 X, y = X.to(device), y.to(device)
44 # 1. Forward pass
---> 45 y_pred = model(X)
47 # 2. Calculate and accumulate loss
48 loss = loss_fn(y_pred, y)
File ~\AppData\Local\Programs\Python\Python39\lib\site-
packages\torch\nn\modules\module.py:1518, in
Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) #
type:
ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File ~\AppData\Local\Programs\Python\Python39\lib\site-
packages\torch\nn\modules\module.py:1527, in
Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of
logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or
self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or
_global_backward_hooks
1526 or _global_forward_hooks or
_global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
Cell In[94], line 63, in Vit.forward(self, x)
61 # calculate class token
62 class_token = self.class_token.expand(x.shape[0], -1, -1)
---> 63 x = torch.cat((class_token, x), dim=1)
65 # calculate positional embeddings
66 row_embeddings = self.row_embeddings.repeat(1, x.shape[0], 1)
RuntimeError: Sizes of tensors must match except in dimension 1.
Expected size 1024 but got size 64 for tensor number 1 in the
list.
我不知道您正在使用的架构,但使用平方输入图像是很常见的。尽管对人类来说看起来很奇怪,但用扭曲图像进行训练并在推理时间内输入相同扭曲图像的机器不会产生任何影响。
我假设你的框架不支持矩形输入。
您可以通过用黑条填充图像来使输入成为正方形。这可以在您的数据加载器中轻松完成。
# just for demonstration
rect = np.ones((128, 256, 3)) # rectangular image
square = np.zeros((256, 256, 3)) # square image
square[0:128,:,:] = rect # fill the rect into the square
当然,你需要在推理时以同样的方式预处理图像。
此外,这使得模型不必要变大。所以不用担心,使用你扭曲的图像。
您正在使用模块中的 ViT
vit_pytorch
这是来自
github的
ViT
类定义
class ViT(nn.Module):
def __init__(self, *, image_size, patch_size, num_classes, dim, depth, heads, mlp_dim, pool = 'cls', channels = 3, dim_head = 64, dropout = 0., emb_dropout = 0.):
super().__init__()
image_height, image_width = pair(image_size)
patch_height, patch_width = pair(patch_size)
assert image_height % patch_height == 0 and image_width % patch_width == 0, 'Image dimensions must be divisible by the patch size.'
num_patches = (image_height // patch_height) * (image_width // patch_width)
# more code follows that we dont care about ...
这是
pair()
的定义
def pair(t):
return t if isinstance(t, tuple) else (t, t)
这是取自文档
image_size:如果您有矩形图像,请确保您的图像尺寸是宽度和高度中的最大值。
patch_size:补丁的大小。 image_size 必须能被 patch_size 整除。
补丁数量为:n = (image_size // patch_size) ** 2 并且 n 必须大于 16。根据您的情况,选择 16(较大模型)或 32(较小模型)
如您所见,
ViT
可以将 image_shape 作为元组(首先是高度,然后是宽度)或单个数字来处理。 pair
函数将处理后一种情况并将值复制到元组中。
只有一行代码曾经使用过
image_size
,即用于计算令牌的数量。这个计算相当简单,因为它只需要将 image_height
除以 patch_height
,将 image_width
除以 patch_width
,然后将这些值相乘即可。
所以,回到你的问题:文档讲述的故事与代码不同。阅读代码,情况很清楚,我建议将
(image_height, image_width)
的元组传递给 image_size
的 ViT()
参数。