如何打印接地恐龙模型摘要

问题描述 投票:0回答:1

我正在尝试获取Grounding DINO的模型摘要。我尝试使用 torch-summary 库来执行此操作,但我在指定正确的输入大小方面遇到问题,这是调用汇总函数所必需的。

由于 Grounding DINO 是一个多模态模型(它作为输入(图像、文本)对),我正在努力弄清楚应该传递给汇总函数的输入大小和格式是什么。

from groundingdino.util.inference import load_model
from torchsummary import summary

model = load_model(CONFIG_PATH, WEIGHTS_PATH)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
summary(model, input_size)

我尝试作为 input_size 参数传递:

  • 仅图像大小(例如 (3, 244, 244))
  • 包含图像大小和文本提示的列表(例如 [(3, 244, 244), 'some text'])
  • 用另一个整数扩展图像大小,该整数可能表示文本输入的长度(例如(3, 244, 244, 10))
  • 包含图像大小和可能代表文本输入长度的整数的列表(例如 [(3, 244, 244), 10] 和 [(3, 244, 244), (10,)])

但是所有的尝试都导致了错误。例如,第一次尝试发生的情况是:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/torchsummary/torchsummary.py in summary(model, input_data, batch_dim, branching, col_names, col_width, depth, device, dtypes, verbose, *args, **kwargs)
    139             with torch.no_grad():
--> 140                 _ = model.to(device)(*x, *args, **kwargs)  # type: ignore[misc]
    141         except Exception as e:

3 frames
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used

/content/GroundingDINO/groundingdino/models/GroundingDINO/groundingdino.py in forward(self, samples, targets, **kw)
    242         if targets is None:
--> 243             captions = kw["captions"]
    244         else:

KeyError: 'captions'

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
<ipython-input-9-f3881fbb51d4> in <cell line: 9>()
      7 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
      8 model = model.to(device)
----> 9 summary(model, (3, 224, 224))

/usr/local/lib/python3.10/dist-packages/torchsummary/torchsummary.py in summary(model, input_data, batch_dim, branching, col_names, col_width, depth, device, dtypes, verbose, *args, **kwargs)
    141         except Exception as e:
    142             executed_layers = [layer for layer in summary_list if layer.executed]
--> 143             raise RuntimeError(
    144                 "Failed to run torchsummary. See above stack traces for more details. "
    145                 "Executed layers up to: {}".format(executed_layers)

RuntimeError: Failed to run torchsummary. See above stack traces for more details. Executed layers up to: []

它似乎需要一些关键字输入(标题)。 我试图查看 github 存储库中的预测相关代码,以及模型的forward方法,但我仍然无法解决问题。

torch-summary 的文档还指出,可以直接传递输入数据来代替输入大小,并让函数推断打印摘要所需的内容,因此我尝试了以下方法:

from groundingdino.util.inference import load_image
image_source, image = load_image(IMG_PATH)
caption='some text'
summary(model, image, caption)

但它产生了错误:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/torchsummary/torchsummary.py in summary(model, input_data, batch_dim, branching, col_names, col_width, depth, device, dtypes, verbose, *args, **kwargs)
    139             with torch.no_grad():
--> 140                 _ = model.to(device)(*x, *args, **kwargs)  # type: ignore[misc]
    141         except Exception as e:

4 frames
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used

/content/GroundingDINO/groundingdino/models/GroundingDINO/groundingdino.py in forward(self, samples, targets, **kw)
    244         else:
--> 245             captions = [t["caption"] for t in targets]
    246 

/content/GroundingDINO/groundingdino/models/GroundingDINO/groundingdino.py in <listcomp>(.0)
    244         else:
--> 245             captions = [t["caption"] for t in targets]
    246 

TypeError: string indices must be integers

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
<ipython-input-7-a5f3a38c6e5a> in <cell line: 9>()
      7 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
      8 model = model.to(device)
----> 9 summary(model, image, TEXT_PROMPT)

/usr/local/lib/python3.10/dist-packages/torchsummary/torchsummary.py in summary(model, input_data, batch_dim, branching, col_names, col_width, depth, device, dtypes, verbose, *args, **kwargs)
    141         except Exception as e:
    142             executed_layers = [layer for layer in summary_list if layer.executed]
--> 143             raise RuntimeError(
    144                 "Failed to run torchsummary. See above stack traces for more details. "
    145                 "Executed layers up to: {}".format(executed_layers)

RuntimeError: Failed to run torchsummary. See above stack traces for more details. Executed layers up to: []

summary(model, {'image':image, 'captions':[caption]})
生成错误:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-5139855142c9> in <cell line: 9>()
      7 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
      8 model = model.to(device)
----> 9 summary(model, {'image':image, 'captions':[caption]})

1 frames
/usr/local/lib/python3.10/dist-packages/torchsummary/torchsummary.py in summary(model, input_data, batch_dim, branching, col_names, col_width, depth, device, dtypes, verbose, *args, **kwargs)
    134             device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    135 
--> 136         x, input_size = process_input_data(input_data, batch_dim, device, dtypes)
    137         args, kwargs = set_device(args, device), set_device(kwargs, device)
    138         try:

/usr/local/lib/python3.10/dist-packages/torchsummary/torchsummary.py in process_input_data(input_data, batch_dim, device, dtypes)
    217 
    218     else:
--> 219         raise TypeError(
    220             "Input type is not recognized. Please ensure input_data is valid.\n"
    221             "For multiple inputs to the network, ensure input_data passed in is "

TypeError: Input type is not recognized. Please ensure input_data is valid.
For multiple inputs to the network, ensure input_data passed in is a sequence of tensors or a list of tuple sizes. If you are having trouble here, please submit a GitHub issue.

所以,我的问题是,如何找到正确的输入大小和格式以传递给汇总函数?或者,更一般地说,我如何获得此类模型的摘要? (不一定使用 torch-summary,但我需要通过此库获得相同的信息)。

提前感谢任何能够帮助我解决这个问题的人。

附注 我不确定它是否有帮助,但这是

print(model)
的输出:

GroundingDINO(
  (transformer): Transformer(
    (encoder): TransformerEncoder(
      (layers): ModuleList(
        (0-5): 6 x DeformableTransformerEncoderLayer(
          (self_attn): MultiScaleDeformableAttention(
            (sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
            (attention_weights): Linear(in_features=256, out_features=128, bias=True)
            (value_proj): Linear(in_features=256, out_features=256, bias=True)
            (output_proj): Linear(in_features=256, out_features=256, bias=True)
          )
          (dropout1): Dropout(p=0.0, inplace=False)
          (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (linear1): Linear(in_features=256, out_features=2048, bias=True)
          (dropout2): Dropout(p=0.0, inplace=False)
          (linear2): Linear(in_features=2048, out_features=256, bias=True)
          (dropout3): Dropout(p=0.0, inplace=False)
          (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        )
      )
      (text_layers): ModuleList(
        (0-5): 6 x TransformerEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
          )
          (linear1): Linear(in_features=256, out_features=1024, bias=True)
          (dropout): Dropout(p=0.0, inplace=False)
          (linear2): Linear(in_features=1024, out_features=256, bias=True)
          (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.0, inplace=False)
          (dropout2): Dropout(p=0.0, inplace=False)
        )
      )
      (fusion_layers): ModuleList(
        (0-5): 6 x BiAttentionBlock(
          (layer_norm_v): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (layer_norm_l): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (attn): BiMultiHeadAttention(
            (v_proj): Linear(in_features=256, out_features=1024, bias=True)
            (l_proj): Linear(in_features=256, out_features=1024, bias=True)
            (values_v_proj): Linear(in_features=256, out_features=1024, bias=True)
            (values_l_proj): Linear(in_features=256, out_features=1024, bias=True)
            (out_v_proj): Linear(in_features=1024, out_features=256, bias=True)
            (out_l_proj): Linear(in_features=1024, out_features=256, bias=True)
          )
          (drop_path): DropPath(drop_prob=0.100)
        )
      )
    )
    (decoder): TransformerDecoder(
      (layers): ModuleList(
        (0-5): 6 x DeformableTransformerDecoderLayer(
          (cross_attn): MultiScaleDeformableAttention(
            (sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
            (attention_weights): Linear(in_features=256, out_features=128, bias=True)
            (value_proj): Linear(in_features=256, out_features=256, bias=True)
            (output_proj): Linear(in_features=256, out_features=256, bias=True)
          )
          (dropout1): Identity()
          (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (ca_text): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
          )
          (catext_dropout): Identity()
          (catext_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
          )
          (dropout2): Identity()
          (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (linear1): Linear(in_features=256, out_features=2048, bias=True)
          (dropout3): Identity()
          (linear2): Linear(in_features=2048, out_features=256, bias=True)
          (dropout4): Identity()
          (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        )
      )
      (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      (ref_point_head): MLP(
        (layers): ModuleList(
          (0): Linear(in_features=512, out_features=256, bias=True)
          (1): Linear(in_features=256, out_features=256, bias=True)
        )
      )
      (bbox_embed): ModuleList(
        (0-5): 6 x MLP(
          (layers): ModuleList(
            (0-1): 2 x Linear(in_features=256, out_features=256, bias=True)
            (2): Linear(in_features=256, out_features=4, bias=True)
          )
        )
      )
      (class_embed): ModuleList(
        (0-5): 6 x ContrastiveEmbed()
      )
    )
    (tgt_embed): Embedding(900, 256)
    (enc_output): Linear(in_features=256, out_features=256, bias=True)
    (enc_output_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (enc_out_bbox_embed): MLP(
      (layers): ModuleList(
        (0-1): 2 x Linear(in_features=256, out_features=256, bias=True)
        (2): Linear(in_features=256, out_features=4, bias=True)
      )
    )
    (enc_out_class_embed): ContrastiveEmbed()
  )
  (bert): BertModelWarper(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (pooler): BertPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )
  (feat_map): Linear(in_features=768, out_features=256, bias=True)
  (input_proj): ModuleList(
    (0): Sequential(
      (0): Conv2d(192, 256, kernel_size=(1, 1), stride=(1, 1))
      (1): GroupNorm(32, 256, eps=1e-05, affine=True)
    )
    (1): Sequential(
      (0): Conv2d(384, 256, kernel_size=(1, 1), stride=(1, 1))
      (1): GroupNorm(32, 256, eps=1e-05, affine=True)
    )
    (2): Sequential(
      (0): Conv2d(768, 256, kernel_size=(1, 1), stride=(1, 1))
      (1): GroupNorm(32, 256, eps=1e-05, affine=True)
    )
    (3): Sequential(
      (0): Conv2d(768, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (1): GroupNorm(32, 256, eps=1e-05, affine=True)
    )
  )
  (backbone): Joiner(
    (0): SwinTransformer(
      (patch_embed): PatchEmbed(
        (proj): Conv2d(3, 96, kernel_size=(4, 4), stride=(4, 4))
        (norm): LayerNorm((96,), eps=1e-05, elementwise_affine=True)
      )
      (pos_drop): Dropout(p=0.0, inplace=False)
      (layers): ModuleList(
        (0): BasicLayer(
          (blocks): ModuleList(
            (0): SwinTransformerBlock(
              (norm1): LayerNorm((96,), eps=1e-05, elementwise_affine=True)
              (attn): WindowAttention(
                (qkv): Linear(in_features=96, out_features=288, bias=True)
                (attn_drop): Dropout(p=0.0, inplace=False)
                (proj): Linear(in_features=96, out_features=96, bias=True)
                (proj_drop): Dropout(p=0.0, inplace=False)
                (softmax): Softmax(dim=-1)
              )
              (drop_path): Identity()
              (norm2): LayerNorm((96,), eps=1e-05, elementwise_affine=True)
              (mlp): Mlp(
                (fc1): Linear(in_features=96, out_features=384, bias=True)
                (act): GELU(approximate='none')
                (fc2): Linear(in_features=384, out_features=96, bias=True)
                (drop): Dropout(p=0.0, inplace=False)
              )
            )
            (1): SwinTransformerBlock(
              (norm1): LayerNorm((96,), eps=1e-05, elementwise_affine=True)
              (attn): WindowAttention(
                (qkv): Linear(in_features=96, out_features=288, bias=True)
                (attn_drop): Dropout(p=0.0, inplace=False)
                (proj): Linear(in_features=96, out_features=96, bias=True)
                (proj_drop): Dropout(p=0.0, inplace=False)
                (softmax): Softmax(dim=-1)
              )
              (drop_path): DropPath(drop_prob=0.018)
              (norm2): LayerNorm((96,), eps=1e-05, elementwise_affine=True)
              (mlp): Mlp(
                (fc1): Linear(in_features=96, out_features=384, bias=True)
                (act): GELU(approximate='none')
                (fc2): Linear(in_features=384, out_features=96, bias=True)
                (drop): Dropout(p=0.0, inplace=False)
              )
            )
          )
          (downsample): PatchMerging(
            (reduction): Linear(in_features=384, out_features=192, bias=False)
            (norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
          )
        )
        (1): BasicLayer(
          (blocks): ModuleList(
            (0): SwinTransformerBlock(
              (norm1): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
              (attn): WindowAttention(
                (qkv): Linear(in_features=192, out_features=576, bias=True)
                (attn_drop): Dropout(p=0.0, inplace=False)
                (proj): Linear(in_features=192, out_features=192, bias=True)
                (proj_drop): Dropout(p=0.0, inplace=False)
                (softmax): Softmax(dim=-1)
              )
              (drop_path): DropPath(drop_prob=0.036)
              (norm2): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
              (mlp): Mlp(
                (fc1): Linear(in_features=192, out_features=768, bias=True)
                (act): GELU(approximate='none')
                (fc2): Linear(in_features=768, out_features=192, bias=True)
                (drop): Dropout(p=0.0, inplace=False)
              )
            )
            (1): SwinTransformerBlock(
              (norm1): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
              (attn): WindowAttention(
                (qkv): Linear(in_features=192, out_features=576, bias=True)
                (attn_drop): Dropout(p=0.0, inplace=False)
                (proj): Linear(in_features=192, out_features=192, bias=True)
                (proj_drop): Dropout(p=0.0, inplace=False)
                (softmax): Softmax(dim=-1)
              )
              (drop_path): DropPath(drop_prob=0.055)
              (norm2): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
              (mlp): Mlp(
                (fc1): Linear(in_features=192, out_features=768, bias=True)
                (act): GELU(approximate='none')
                (fc2): Linear(in_features=768, out_features=192, bias=True)
                (drop): Dropout(p=0.0, inplace=False)
              )
            )
          )
          (downsample): PatchMerging(
            (reduction): Linear(in_features=768, out_features=384, bias=False)
            (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          )
        )
        (2): BasicLayer(
          (blocks): ModuleList(
            (0): SwinTransformerBlock(
              (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
              (attn): WindowAttention(
                (qkv): Linear(in_features=384, out_features=1152, bias=True)
                (attn_drop): Dropout(p=0.0, inplace=False)
                (proj): Linear(in_features=384, out_features=384, bias=True)
                (proj_drop): Dropout(p=0.0, inplace=False)
                (softmax): Softmax(dim=-1)
              )
              (drop_path): DropPath(drop_prob=0.073)
              (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
              (mlp): Mlp(
                (fc1): Linear(in_features=384, out_features=1536, bias=True)
                (act): GELU(approximate='none')
                (fc2): Linear(in_features=1536, out_features=384, bias=True)
                (drop): Dropout(p=0.0, inplace=False)
              )
            )
            (1): SwinTransformerBlock(
              (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
              (attn): WindowAttention(
                (qkv): Linear(in_features=384, out_features=1152, bias=True)
                (attn_drop): Dropout(p=0.0, inplace=False)
                (proj): Linear(in_features=384, out_features=384, bias=True)
                (proj_drop): Dropout(p=0.0, inplace=False)
                (softmax): Softmax(dim=-1)
              )
              (drop_path): DropPath(drop_prob=0.091)
              (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
              (mlp): Mlp(
                (fc1): Linear(in_features=384, out_features=1536, bias=True)
                (act): GELU(approximate='none')
                (fc2): Linear(in_features=1536, out_features=384, bias=True)
                (drop): Dropout(p=0.0, inplace=False)
              )
            )
            (2): SwinTransformerBlock(
              (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
              (attn): WindowAttention(
                (qkv): Linear(in_features=384, out_features=1152, bias=True)
                (attn_drop): Dropout(p=0.0, inplace=False)
                (proj): Linear(in_features=384, out_features=384, bias=True)
                (proj_drop): Dropout(p=0.0, inplace=False)
                (softmax): Softmax(dim=-1)
              )
              (drop_path): DropPath(drop_prob=0.109)
              (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
              (mlp): Mlp(
                (fc1): Linear(in_features=384, out_features=1536, bias=True)
                (act): GELU(approximate='none')
                (fc2): Linear(in_features=1536, out_features=384, bias=True)
                (drop): Dropout(p=0.0, inplace=False)
              )
            )
            (3): SwinTransformerBlock(
              (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
              (attn): WindowAttention(
                (qkv): Linear(in_features=384, out_features=1152, bias=True)
                (attn_drop): Dropout(p=0.0, inplace=False)
                (proj): Linear(in_features=384, out_features=384, bias=True)
                (proj_drop): Dropout(p=0.0, inplace=False)
                (softmax): Softmax(dim=-1)
              )
              (drop_path): DropPath(drop_prob=0.127)
              (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
              (mlp): Mlp(
                (fc1): Linear(in_features=384, out_features=1536, bias=True)
                (act): GELU(approximate='none')
                (fc2): Linear(in_features=1536, out_features=384, bias=True)
                (drop): Dropout(p=0.0, inplace=False)
              )
            )
            (4): SwinTransformerBlock(
              (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
              (attn): WindowAttention(
                (qkv): Linear(in_features=384, out_features=1152, bias=True)
                (attn_drop): Dropout(p=0.0, inplace=False)
                (proj): Linear(in_features=384, out_features=384, bias=True)
                (proj_drop): Dropout(p=0.0, inplace=False)
                (softmax): Softmax(dim=-1)
              )
              (drop_path): DropPath(drop_prob=0.145)
              (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
              (mlp): Mlp(
                (fc1): Linear(in_features=384, out_features=1536, bias=True)
                (act): GELU(approximate='none')
                (fc2): Linear(in_features=1536, out_features=384, bias=True)
                (drop): Dropout(p=0.0, inplace=False)
              )
            )
            (5): SwinTransformerBlock(
              (norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
              (attn): WindowAttention(
                (qkv): Linear(in_features=384, out_features=1152, bias=True)
                (attn_drop): Dropout(p=0.0, inplace=False)
                (proj): Linear(in_features=384, out_features=384, bias=True)
                (proj_drop): Dropout(p=0.0, inplace=False)
                (softmax): Softmax(dim=-1)
              )
              (drop_path): DropPath(drop_prob=0.164)
              (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
              (mlp): Mlp(
                (fc1): Linear(in_features=384, out_features=1536, bias=True)
                (act): GELU(approximate='none')
                (fc2): Linear(in_features=1536, out_features=384, bias=True)
                (drop): Dropout(p=0.0, inplace=False)
              )
            )
          )
          (downsample): PatchMerging(
            (reduction): Linear(in_features=1536, out_features=768, bias=False)
            (norm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
          )
        )
        (3): BasicLayer(
          (blocks): ModuleList(
            (0): SwinTransformerBlock(
              (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (attn): WindowAttention(
                (qkv): Linear(in_features=768, out_features=2304, bias=True)
                (attn_drop): Dropout(p=0.0, inplace=False)
                (proj): Linear(in_features=768, out_features=768, bias=True)
                (proj_drop): Dropout(p=0.0, inplace=False)
                (softmax): Softmax(dim=-1)
              )
              (drop_path): DropPath(drop_prob=0.182)
              (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (mlp): Mlp(
                (fc1): Linear(in_features=768, out_features=3072, bias=True)
                (act): GELU(approximate='none')
                (fc2): Linear(in_features=3072, out_features=768, bias=True)
                (drop): Dropout(p=0.0, inplace=False)
              )
            )
            (1): SwinTransformerBlock(
              (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (attn): WindowAttention(
                (qkv): Linear(in_features=768, out_features=2304, bias=True)
                (attn_drop): Dropout(p=0.0, inplace=False)
                (proj): Linear(in_features=768, out_features=768, bias=True)
                (proj_drop): Dropout(p=0.0, inplace=False)
                (softmax): Softmax(dim=-1)
              )
              (drop_path): DropPath(drop_prob=0.200)
              (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (mlp): Mlp(
                (fc1): Linear(in_features=768, out_features=3072, bias=True)
                (act): GELU(approximate='none')
                (fc2): Linear(in_features=3072, out_features=768, bias=True)
                (drop): Dropout(p=0.0, inplace=False)
              )
            )
          )
        )
      )
      (norm1): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
      (norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    )
    (1): PositionEmbeddingSineHW()
  )
  (bbox_embed): ModuleList(
    (0-5): 6 x MLP(
      (layers): ModuleList(
        (0-1): 2 x Linear(in_features=256, out_features=256, bias=True)
        (2): Linear(in_features=256, out_features=4, bias=True)
      )
    )
  )
  (class_embed): ModuleList(
    (0-5): 6 x ContrastiveEmbed()
  )
)
deep-learning pytorch computer-vision transformer-model zeroshot-classification
1个回答
0
投票

无需输入尺寸,可直接打印模型摘要。

from torchsummary import summary
summary(model)

Output:
=====================================================================================
Layer (type:depth-idx)                                       Param #
=====================================================================================
├─Transformer: 1-1                                           --
|    └─TransformerEncoder: 2-1                               --
|    |    └─ModuleList: 3-1                                  7,693,056
|    |    └─ModuleList: 3-2                                  4,738,560
|    |    └─ModuleList: 3-3                                  9,474,048
|    └─TransformerDecoder: 2-2                               --
|    |    └─ModuleList: 3-4                                  10,857,216
|    |    └─LayerNorm: 3-5                                   512
|    |    └─MLP: 3-6                                         197,120
|    |    └─ModuleList: 3-7                                  132,612
|    |    └─ModuleList: 3-8                                  --
|    └─Embedding: 2-3                                        230,400
|    └─Linear: 2-4                                           65,792
|    └─LayerNorm: 2-5                                        512
|    └─MLP: 2-6                                              --
|    |    └─ModuleList: 3-9                                  132,612
|    └─ContrastiveEmbed: 2-7                                 --
├─BertModelWarper: 1-2                                       --
|    └─BertEmbeddings: 2-8                                   --
|    |    └─Embedding: 3-10                                  23,440,896
|    |    └─Embedding: 3-11                                  393,216
|    |    └─Embedding: 3-12                                  1,536
|    |    └─LayerNorm: 3-13                                  1,536
|    |    └─Dropout: 3-14                                    --
|    └─BertEncoder: 2-9                                      --
|    |    └─ModuleList: 3-15                                 85,054,464
|    └─BertPooler: 2-10                                      --
|    |    └─Linear: 3-16                                     (590,592)
|    |    └─Tanh: 3-17                                       --
├─Linear: 1-3                                                196,864
├─ModuleList: 1-4                                            --
|    └─Sequential: 2-11                                      --
|    |    └─Conv2d: 3-18                                     49,408
|    |    └─GroupNorm: 3-19                                  512
|    └─Sequential: 2-12                                      --
|    |    └─Conv2d: 3-20                                     98,560
|    |    └─GroupNorm: 3-21                                  512
|    └─Sequential: 2-13                                      --
|    |    └─Conv2d: 3-22                                     196,864
|    |    └─GroupNorm: 3-23                                  512
|    └─Sequential: 2-14                                      --
|    |    └─Conv2d: 3-24                                     1,769,728
|    |    └─GroupNorm: 3-25                                  512
├─Joiner: 1-5                                                --
|    └─SwinTransformer: 2-15                                 --
|    |    └─PatchEmbed: 3-26                                 4,896
|    |    └─Dropout: 3-27                                    --
|    |    └─ModuleList: 3-28                                 27,512,922
|    |    └─LayerNorm: 3-29                                  384
|    |    └─LayerNorm: 3-30                                  768
|    |    └─LayerNorm: 3-31                                  1,536
|    └─PositionEmbeddingSineHW: 2-16                         --
├─ModuleList: 1-6                                            (recursive)
|    └─MLP: 2-17                                             (recursive)
|    |    └─ModuleList: 3-32                                 132,612
├─ModuleList: 1-7                                            --
|    └─ContrastiveEmbed: 2-18                                --
=====================================================================================
Total params: 172,971,270
Trainable params: 172,380,678
Non-trainable params: 590,592
=====================================================================================
© www.soinside.com 2019 - 2024. All rights reserved.