我在视觉转换器中有矩形图像数据集。我设置 image_size= (128, 256) 但补丁大小可能是多少？

Question

Any help will be greatly appreciated. I am very much confused about when i use image_size=(128, 256 ), what will be the  patch_size, if i use patch_size= 16 only i can run code upto  model training phase. Here is how i set dimensions

--------------------------

#import torch.nn.functional as nnf
# Create image size
IMG_SIZE = 128, 256

# Create transform pipeline manually
manual_transforms = transforms.Compose([
    transforms.Resize((IMG_SIZE)),
    transforms.ToTensor(),
])           
print (f"Manually created transforms: {manual_transforms}") 

Output: Manually created transforms: Compose (
Resize (size= (128, 256), interpolation=bilinear, max_size=None, 
antialias=warn)
ToTensor()
)

我正在尝试使用视觉转换器进行图像分类在自定义数据集上。但我的数据集包含所有矩形形状。当我使用图像大小 = 224 时，准确性不太好，因为我猜矩形形状被重塑为 224 X 224 的正方形，因此在训练阶段，图像特征没有完全提取。我想将图像输入到 Transformers 编码器中，格式为 128 X 256，但是当我设置 patch_siz= 16 时，在进入模型训练阶段时出现错误。 **运行时错误是--> *****RuntimeError：张量的大小必须尺寸 1 除外。预期尺寸 1024，但实际尺寸为 64 列表中的张量编号 1。 ******

  I go errorless results when i use image size as 224X224. the issue is with rectangular shaped dimensions. I am using batch size = 16, image_size=(128, 256), patch size= 16. But is patch_size 16 is ok or it could be patch_size=(16X32) but when try to set patch_size=(16, 32) like this


import torch.nn.functional as nnf
# Create image size
IMG_SIZE =  128, 256

# Create transform pipeline manually
manual_transforms = transforms.Compose([
    transforms.Resize((IMG_SIZE)),
    transforms.ToTensor(),
])           
print(f"Manually created transforms: {manual_transforms}")
--------
OUTPUT=Manually created transforms: Compose(
Resize(size=(128, 256), interpolation=bilinear, max_size=None, 
antialias=warn)
ToTensor()
)
--------
# Set the batch size
BATCH_SIZE = 16

# Create data loaders
train_dataloader, test_dataloader, class_names = 
create_dataloaders(
    train_dir=train_dir,
    test_dir=test_dir,
    transform=manual_transforms, 
    batch_size=BATCH_SIZE
)

train_dataloader, test_dataloader, class_names

----------------------------------
OUTPUT:(<torch.utils.data.dataloader.DataLoader at 
0x2148d4b3700>,
 <torch.utils.data.dataloader.DataLoader at 0x2148d4b3a00>,
 ['tempered', 'genuine'])
______________________________________________

# 1. Create a class which subclasses nn.Module
class PatchEmbedding(nn.Module):
    """Turns a 2D input image into a 1D sequence learnable 
embedding vector.

    Args:
        in_channels (int): Number of color channels for the input 
images. Defaults to 3.
        patch_size (int): Size of patches to convert input image 
into. Defaults to 16.
        embedding_dim (int): Size of embedding to turn image 
into. Defaults to 768.
""" 
# 2. Initialize the class with appropriate variables
def __init__(self, 
                 in_channels:int=3,
                 patch_size:int=(16, 32),
                 embedding_dim:int=768):
        super().__init__()
    
# 3. Create a layer to turn an image into patches
self.patcher = nn.Conv2d(in_channels=in_channels,
                             out_channels=embedding_dim,
                             kernel_size=patch_size,
                             stride=patch_size,
                             padding=0)

# 4. Create a layer to flatten the patch feature maps into a 
single dimension
self.flatten = nn.Flatten(start_dim=2, # only flatten the 
feature map dimensions into a
single vector
                              end_dim=3)

# 5. Define the forward method 
def forward(self, x):
# Create assertion to check that inputs are the correct shape
image_resolution = x.shape[-1]
assert image_resolution % patch_size == 0, f"Input image size 
must be divisble by patch size,
image shape: {image_resolution}, patch size: {patch_size}"
    
# Perform the forward pass
        x_patched = self.patcher(x)
        x_flattened = self.flatten(x_patched) 
    
# 6. Make sure the output shape has the right order 
        return x_flattened.permute(0, 2, 1)
-------------------------------------------------

# Let's test it on single image
patch_size = (16, 32)

# Set seeds
def set_seeds(seed: int=42):
"""Sets random sets for torch operations.

Args:
    seed (int, optional): Random seed to set. Defaults to 42.
"""
# Set the seed for general torch operations
torch.manual_seed(seed)
# Set the seed for CUDA torch operations (ones that happen on the 
GPU)
torch.cuda.manual_seed(seed)

set_seeds()

 # Create an instance of patch embedding layer
patchify = PatchEmbedding(in_channels=3,
                      patch_size=(16, 32), 
                      embedding_dim=768)

 # Pass a single image through
print(f"Input image shape: {image.unsqueeze(0).shape}")
patch_embedded_image = patchify(image.unsqueeze(0)) # add an 
extra batch dimension on the 0th
index, otherwise will error
print(f"Output patch embedding shape: 
{patch_embedded_image.shape}")

**here is full code blocks where error traceback is 
mentionining**

________________________________________________________________
TypeError                                 Traceback (most recent 
call last)
Cell In[111], line 27
25 # Pass a single image through
26 print(f"Input image shape: {image.unsqueeze(0).shape}")
---> 27 patch_embedded_image = patchify(image.unsqueeze(0)) # add 
an extra batch dimension on
the 0th index, otherwise will error
28 print(f"Output patch embedding shape: 
{patch_embedded_image.shape}")

File ~\AppData\Local\Programs\Python\Python39\lib\site- 
packages\torch\nn\modules\module.py:1518,
in Module._wrapped_call_impl(self, *args, **kwargs)
1516     return self._compiled_call_impl(*args, **kwargs)  # 
type: ignore[misc]
1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File ~\AppData\Local\Programs\Python\Python39\lib\site- 
packages\torch\nn\modules\module.py:1527,
in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of 
the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or 
self._forward_hooks or 
self._forward_pre_hooks
1525         or _global_backward_pre_hooks or 
_global_backward_hooks
1526         or _global_forward_hooks or 
_global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
1529 try:
1530     result = None

Cell In[107], line 32, in PatchEmbedding.forward(self, x)
29 def forward(self, x):
30     # Create assertion to check that inputs are the correct 
shape
31     image_resolution = x.shape[-1]
---> 32     assert image_resolution % patch_size == 0, f"Input 
image size must be divisble by
patch size, image shape: {image_resolution}, patch size: 
{patch_size}"
34     # Perform the forward pass
35     x_patched = self.patcher(x)

TypeError: unsupported operand type(s) for %: 'int' and 'tuple'

----------------------------------------------------------

BUt if i set patch_size=16 throghly then after this code got 
error 
TypeError: cannot unpack non-iterable int object
_________________________________________________________
If i use patch_size=(16,32) then 

---------------------
import torch
from vit_pytorch import ViT

class Vit(nn.Module):
"""creates a vision transformer architecture with vit-base 
hyperparameters by default."""

def __init__(self, img_size=(128, 256), in_channels=3, 
patch_size=(16, 32), num_transformer_layers=12, 
embedding_dim=768, 
mlp_size=3072, num_classes=1000, dim=1024, depth=6, num_heads=8, 
mlp_dim=2048, mlp_dropout=0.1, embedding_dropout=0.1):
super().__init__()
self.img_size = img_size
self.in_channels=in_channels
self.patch_size = patch_size
self.num_transformer_layers = num_transformer_layers
self.embedding_dim = embedding_dim
self.mlp_size=mlp_size
self.num_classes = num_classes
self.dim =dim
self.depth = depth
self.num_heads = num_heads
self.mlp_dim = mlp_dim
self.mlp_dropout = mlp_dropout
self.embedding_dropout = embedding_dropout
    
    # calculate number of patches
height, width = img_size
patch_height, patch_width = patch_size
self.num_patches = (height // patch_height) * (width // 
patch_width)
    
    # calculate patch embedding
self.patch_embedding = nn.Conv2d(in_channels=in_channels, 
embedding_dim=enbedding_dim, patch_size=patch_size, 
stride=patch_size, bias=False)
self.patch_embedding = nn.Conv2d(in_channels=in_channels,
                                          kernel_size=patch_size,
                                          
 embedding_dim=enbedding_dim,
                                            bias=False)
    # calculate class token
    self.class_token = nn.Parameter(torch.randn(1, 1, dim))
    
    # calculate positional embeddings
    self.row_embeddings = nn.Parameter(torch.randn(height // 
patch_height, 1, dim))
self.col_embeddings = nn.Parameter(torch.randn(width // 
patch_width, 1, dim))
    
    # calculate transformer blocks
    self.transformer_encoder = 









nn.ModuleList([TransformerEncoderBlock(
                                     embedding_dim=embedding_dim,
                                                                        
                                      num_heads=num_heads,
                                                                        
                                       mlp_size=mlp_size,
                                                                    
                                       mlp_dropout=mlp_dropout) 
                           for _ in range(num_transformer_layers)
])
    
    # calculate layer normalization
    self.layer_norm = nn.LayerNorm(dim)
    
    # calculate classification head
    self.classification_head = nn.Linear(dim, num_classes)
    
def forward(self, x):
    # calculate patch embeddings
    x = self.patch_embedding(x)
    x = x.flatten(2).transpose(1, 2)
    
    # calculate class token
    class_token = self.class_token.expand(x.shape[0], -1, -1)
    x = torch.cat((class_token, x), dim=1)
    
    # calculate positional embeddings
    row_embeddings = self.row_embeddings.repeat(1, x.shape[0], 1)
    col_embeddings = self.col_embeddings.repeat(1, x.shape[0], 1)
    x = x + row_embeddings + col_embeddings
    
    # calculate transformer blocks
    for transformer_block in self.transformer_encoder:
        x = transformer_block(x)
    
    # calculate layer normalization
    x = self.layer_norm(x)
    
    # calculate classification head
    class_logits = self.classification_head(x[:, 0])
    
return class_logits
_______________________________________

# Train our MOdel

# Create an instance of ViT with the number of classes we're 
working with (-,-)
vit = Vit(num_classes=len(class_names))

____________________________________________________________

from going_modular.going_modular import engine

# Setup the optimizer to optimize our ViT model parameters using 
hyperparameters from the ViT
paper 
optimizer = torch.optim.Adam(params=vit.parameters(), 
                         lr=3e-3, # Base LR from Table 3 for ViT- 
* ImageNet-1k
                         betas=(0.9, 0.999), 
                         weight_decay=0.3) # from the ViT paper 
section 4.1 (Training & Fine-
                         tuning) and Table 3 for ViT-* ImageNet- 
1k

# Setup the loss function for multi-class classification
loss_fn = torch.nn.CrossEntropyLoss()

# Set the seeds
set_seeds()

# Train the model and save the training results to a dictionary
results = engine.train(model=vit,
                   train_dataloader=train_dataloader,
                   test_dataloader=test_dataloader,
                   optimizer=optimizer,
                   loss_fn=loss_fn,
                   epochs=10,
                   device=device)

---------------------------------
OUT-
0%|                                                                                                                                               
----------------------------------------------------------------- 

-----------------------------------------------------------------
RuntimeError                              Traceback (most recent 
call 
last)
Cell In[132], line 16
13 set_seeds()
15 # Train the model and save the training results to a 
dictionary
---> 16 results = engine.train(model=vit,
17                        train_dataloader=train_dataloader,
18                        test_dataloader=test_dataloader,
 
19                        optimizer=optimizer,
20                        loss_fn=loss_fn,
21                        epochs=10,
22                        device=device)

File ~\AppData\Local\Programs\Python\Python39\Scripts\Image- 
Classification-Using-Vision-transformer- 
main\going_modular\going_modular\engine.py:169, in train(model, 
train_dataloader, test_dataloader, optimizer, loss_fn, epochs, 
device)
167 # Loop through training and testing steps for a number of 
epochs
168 for epoch in tqdm(range(epochs)):
--> 169     train_loss, train_acc = train_step(model=model,
170                                       
dataloader=train_dataloader,
171                                       loss_fn=loss_fn,
172                                       optimizer=optimizer,
173                                       device=device)
174     test_loss, test_acc = test_step(model=model,
175       dataloader=test_dataloader,
176       loss_fn=loss_fn,
177       device=device)
179     # Print out what's happening

File ~\AppData\Local\Programs\Python\Python39\Scripts\Image- 
Classification-Using-Vision-transformer- 
main\going_modular\going_modular\engine.py:45, in 
 train_step(model, 
 dataloader, loss_fn, optimizer, device)
 42 X, y = X.to(device), y.to(device)
 44 # 1. Forward pass
---> 45 y_pred = model(X)
 47 # 2. Calculate  and accumulate loss
 48 loss = loss_fn(y_pred, y)

File ~\AppData\Local\Programs\Python\Python39\lib\site- 
 packages\torch\nn\modules\module.py:1518, in 
 Module._wrapped_call_impl(self, *args, **kwargs)
 1516     return self._compiled_call_impl(*args, **kwargs)  # 
 type: 
 ignore[misc]
 1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File ~\AppData\Local\Programs\Python\Python39\lib\site- 
packages\torch\nn\modules\module.py:1527, in 
Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of  
logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or 
self._forward_hooks or self._forward_pre_hooks
1525         or _global_backward_pre_hooks or 
_global_backward_hooks
1526         or _global_forward_hooks or 
_global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
1529 try:
1530     result = None

Cell In[94], line 63, in Vit.forward(self, x)
 61 # calculate class token
 62 class_token = self.class_token.expand(x.shape[0], -1, -1)
 ---> 63 x = torch.cat((class_token, x), dim=1)
 65 # calculate positional embeddings
 66 row_embeddings = self.row_embeddings.repeat(1, x.shape[0], 1)

RuntimeError: Sizes of tensors must match except in dimension 1. 
Expected size 1024 but got size 64 for tensor number 1 in the 
list.

Answer 1

我不知道您正在使用的架构，但使用平方输入图像是很常见的。尽管对人类来说看起来很奇怪，但用扭曲图像进行训练并在推理时间内输入相同扭曲图像的机器不会产生任何影响。

我假设你的框架不支持矩形输入。

您可以通过用黑条填充图像来使输入成为正方形。这可以在您的数据加载器中轻松完成。

# just for demonstration
rect = np.ones((128, 256, 3))  # rectangular image
square = np.zeros((256, 256, 3))  # square image
square[0:128,:,:] = rect  # fill the rect into the square

当然，你需要在推理时以同样的方式预处理图像。

此外，这使得模型不必要变大。所以不用担心，使用你扭曲的图像。

Answer 2

您正在使用模块中的 ViT

vit_pytorch

这是来自

github

的 ViT

类定义

class ViT(nn.Module):
    def __init__(self, *, image_size, patch_size, num_classes, dim, depth, heads, mlp_dim, pool = 'cls', channels = 3, dim_head = 64, dropout = 0., emb_dropout = 0.):
        super().__init__()
        image_height, image_width = pair(image_size)
        patch_height, patch_width = pair(patch_size)

        assert image_height % patch_height == 0 and image_width % patch_width == 0, 'Image dimensions must be divisible by the patch size.'

        num_patches = (image_height // patch_height) * (image_width // patch_width)

        # more code follows that we dont care about ...

这是

pair()

的定义

def pair(t):
    return t if isinstance(t, tuple) else (t, t)

这是取自文档

image_size：如果您有矩形图像，请确保您的图像尺寸是宽度和高度中的最大值。

patch_size：补丁的大小。 image_size 必须能被 patch_size 整除。

补丁数量为：n = (image_size // patch_size) ** 2 并且 n 必须大于 16。根据您的情况，选择 16（较大模型）或 32（较小模型）

如您所见，

ViT

可以将 image_shape 作为元组（首先是高度，然后是宽度）或单个数字来处理。

pair

函数将处理后一种情况并将值复制到元组中。

只有一行代码曾经使用过

image_size

，即用于计算令牌的数量。这个计算相当简单，因为它只需要将

image_height

除以

patch_height

，将

image_width

除以

patch_width

，然后将这些值相乘即可。

所以，回到你的问题：文档讲述的故事与代码不同。阅读代码，情况很清楚，我建议将

(image_height, image_width)

的元组传递给

image_size

的

ViT()

参数。

我在视觉转换器中有矩形图像数据集。我设置 image_size= (128, 256) 但补丁大小可能是多少？

问题描述投票：0回答：2

2个回答

最新问题

我在视觉转换器中有矩形图像数据集。我设置 image_size= (128, 256) 但补丁大小可能是多少？

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2