我正在尝试获取Grounding DINO的模型摘要。我尝试使用 torch-summary 库来执行此操作,但我在指定正确的输入大小方面遇到问题,这是调用汇总函数所必需的。
由于 Grounding DINO 是一个多模态模型(它作为输入(图像、文本)对),我正在努力弄清楚应该传递给汇总函数的输入大小和格式是什么。
from groundingdino.util.inference import load_model
from torchsummary import summary
model = load_model(CONFIG_PATH, WEIGHTS_PATH)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
summary(model, input_size)
我尝试作为 input_size 参数传递:
但是所有的尝试都导致了错误。例如,第一次尝试发生的情况是:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/torchsummary/torchsummary.py in summary(model, input_data, batch_dim, branching, col_names, col_width, depth, device, dtypes, verbose, *args, **kwargs)
139 with torch.no_grad():
--> 140 _ = model.to(device)(*x, *args, **kwargs) # type: ignore[misc]
141 except Exception as e:
3 frames
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
/content/GroundingDINO/groundingdino/models/GroundingDINO/groundingdino.py in forward(self, samples, targets, **kw)
242 if targets is None:
--> 243 captions = kw["captions"]
244 else:
KeyError: 'captions'
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
<ipython-input-9-f3881fbb51d4> in <cell line: 9>()
7 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
8 model = model.to(device)
----> 9 summary(model, (3, 224, 224))
/usr/local/lib/python3.10/dist-packages/torchsummary/torchsummary.py in summary(model, input_data, batch_dim, branching, col_names, col_width, depth, device, dtypes, verbose, *args, **kwargs)
141 except Exception as e:
142 executed_layers = [layer for layer in summary_list if layer.executed]
--> 143 raise RuntimeError(
144 "Failed to run torchsummary. See above stack traces for more details. "
145 "Executed layers up to: {}".format(executed_layers)
RuntimeError: Failed to run torchsummary. See above stack traces for more details. Executed layers up to: []
它似乎需要一些关键字输入(标题)。 我试图查看 github 存储库中的预测相关代码,以及模型的forward方法,但我仍然无法解决问题。
torch-summary 的文档还指出,可以直接传递输入数据来代替输入大小,并让函数推断打印摘要所需的内容,因此我尝试了以下方法:
from groundingdino.util.inference import load_image
image_source, image = load_image(IMG_PATH)
caption='some text'
summary(model, image, caption)
但它产生了错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/torchsummary/torchsummary.py in summary(model, input_data, batch_dim, branching, col_names, col_width, depth, device, dtypes, verbose, *args, **kwargs)
139 with torch.no_grad():
--> 140 _ = model.to(device)(*x, *args, **kwargs) # type: ignore[misc]
141 except Exception as e:
4 frames
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
/content/GroundingDINO/groundingdino/models/GroundingDINO/groundingdino.py in forward(self, samples, targets, **kw)
244 else:
--> 245 captions = [t["caption"] for t in targets]
246
/content/GroundingDINO/groundingdino/models/GroundingDINO/groundingdino.py in <listcomp>(.0)
244 else:
--> 245 captions = [t["caption"] for t in targets]
246
TypeError: string indices must be integers
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
<ipython-input-7-a5f3a38c6e5a> in <cell line: 9>()
7 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
8 model = model.to(device)
----> 9 summary(model, image, TEXT_PROMPT)
/usr/local/lib/python3.10/dist-packages/torchsummary/torchsummary.py in summary(model, input_data, batch_dim, branching, col_names, col_width, depth, device, dtypes, verbose, *args, **kwargs)
141 except Exception as e:
142 executed_layers = [layer for layer in summary_list if layer.executed]
--> 143 raise RuntimeError(
144 "Failed to run torchsummary. See above stack traces for more details. "
145 "Executed layers up to: {}".format(executed_layers)
RuntimeError: Failed to run torchsummary. See above stack traces for more details. Executed layers up to: []
summary(model, {'image':image, 'captions':[caption]})
生成错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-10-5139855142c9> in <cell line: 9>()
7 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
8 model = model.to(device)
----> 9 summary(model, {'image':image, 'captions':[caption]})
1 frames
/usr/local/lib/python3.10/dist-packages/torchsummary/torchsummary.py in summary(model, input_data, batch_dim, branching, col_names, col_width, depth, device, dtypes, verbose, *args, **kwargs)
134 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
135
--> 136 x, input_size = process_input_data(input_data, batch_dim, device, dtypes)
137 args, kwargs = set_device(args, device), set_device(kwargs, device)
138 try:
/usr/local/lib/python3.10/dist-packages/torchsummary/torchsummary.py in process_input_data(input_data, batch_dim, device, dtypes)
217
218 else:
--> 219 raise TypeError(
220 "Input type is not recognized. Please ensure input_data is valid.\n"
221 "For multiple inputs to the network, ensure input_data passed in is "
TypeError: Input type is not recognized. Please ensure input_data is valid.
For multiple inputs to the network, ensure input_data passed in is a sequence of tensors or a list of tuple sizes. If you are having trouble here, please submit a GitHub issue.
所以,我的问题是,如何找到正确的输入大小和格式以传递给汇总函数?或者,更一般地说,我如何获得此类模型的摘要? (不一定使用 torch-summary,但我需要通过此库获得相同的信息)。
提前感谢任何能够帮助我解决这个问题的人。
附注 我不确定它是否有帮助,但这是
print(model)
的输出:
GroundingDINO(
(transformer): Transformer(
(encoder): TransformerEncoder(
(layers): ModuleList(
(0-5): 6 x DeformableTransformerEncoderLayer(
(self_attn): MultiScaleDeformableAttention(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout1): Dropout(p=0.0, inplace=False)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout2): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(dropout3): Dropout(p=0.0, inplace=False)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
)
(text_layers): ModuleList(
(0-5): 6 x TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.0, inplace=False)
(dropout2): Dropout(p=0.0, inplace=False)
)
)
(fusion_layers): ModuleList(
(0-5): 6 x BiAttentionBlock(
(layer_norm_v): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(layer_norm_l): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn): BiMultiHeadAttention(
(v_proj): Linear(in_features=256, out_features=1024, bias=True)
(l_proj): Linear(in_features=256, out_features=1024, bias=True)
(values_v_proj): Linear(in_features=256, out_features=1024, bias=True)
(values_l_proj): Linear(in_features=256, out_features=1024, bias=True)
(out_v_proj): Linear(in_features=1024, out_features=256, bias=True)
(out_l_proj): Linear(in_features=1024, out_features=256, bias=True)
)
(drop_path): DropPath(drop_prob=0.100)
)
)
)
(decoder): TransformerDecoder(
(layers): ModuleList(
(0-5): 6 x DeformableTransformerDecoderLayer(
(cross_attn): MultiScaleDeformableAttention(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout1): Identity()
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(ca_text): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(catext_dropout): Identity()
(catext_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout2): Identity()
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout3): Identity()
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(dropout4): Identity()
(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(ref_point_head): MLP(
(layers): ModuleList(
(0): Linear(in_features=512, out_features=256, bias=True)
(1): Linear(in_features=256, out_features=256, bias=True)
)
)
(bbox_embed): ModuleList(
(0-5): 6 x MLP(
(layers): ModuleList(
(0-1): 2 x Linear(in_features=256, out_features=256, bias=True)
(2): Linear(in_features=256, out_features=4, bias=True)
)
)
)
(class_embed): ModuleList(
(0-5): 6 x ContrastiveEmbed()
)
)
(tgt_embed): Embedding(900, 256)
(enc_output): Linear(in_features=256, out_features=256, bias=True)
(enc_output_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(enc_out_bbox_embed): MLP(
(layers): ModuleList(
(0-1): 2 x Linear(in_features=256, out_features=256, bias=True)
(2): Linear(in_features=256, out_features=4, bias=True)
)
)
(enc_out_class_embed): ContrastiveEmbed()
)
(bert): BertModelWarper(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0-11): 12 x BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(pooler): BertPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
(feat_map): Linear(in_features=768, out_features=256, bias=True)
(input_proj): ModuleList(
(0): Sequential(
(0): Conv2d(192, 256, kernel_size=(1, 1), stride=(1, 1))
(1): GroupNorm(32, 256, eps=1e-05, affine=True)
)
(1): Sequential(
(0): Conv2d(384, 256, kernel_size=(1, 1), stride=(1, 1))
(1): GroupNorm(32, 256, eps=1e-05, affine=True)
)
(2): Sequential(
(0): Conv2d(768, 256, kernel_size=(1, 1), stride=(1, 1))
(1): GroupNorm(32, 256, eps=1e-05, affine=True)
)
(3): Sequential(
(0): Conv2d(768, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(1): GroupNorm(32, 256, eps=1e-05, affine=True)
)
)
(backbone): Joiner(
(0): SwinTransformer(
(patch_embed): PatchEmbed(
(proj): Conv2d(3, 96, kernel_size=(4, 4), stride=(4, 4))
(norm): LayerNorm((96,), eps=1e-05, elementwise_affine=True)
)
(pos_drop): Dropout(p=0.0, inplace=False)
(layers): ModuleList(
(0): BasicLayer(
(blocks): ModuleList(
(0): SwinTransformerBlock(
(norm1): LayerNorm((96,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
(qkv): Linear(in_features=96, out_features=288, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=96, out_features=96, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): Identity()
(norm2): LayerNorm((96,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=96, out_features=384, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=384, out_features=96, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
(1): SwinTransformerBlock(
(norm1): LayerNorm((96,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
(qkv): Linear(in_features=96, out_features=288, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=96, out_features=96, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath(drop_prob=0.018)
(norm2): LayerNorm((96,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=96, out_features=384, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=384, out_features=96, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
)
(downsample): PatchMerging(
(reduction): Linear(in_features=384, out_features=192, bias=False)
(norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
)
)
(1): BasicLayer(
(blocks): ModuleList(
(0): SwinTransformerBlock(
(norm1): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
(qkv): Linear(in_features=192, out_features=576, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=192, out_features=192, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath(drop_prob=0.036)
(norm2): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=192, out_features=768, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=768, out_features=192, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
(1): SwinTransformerBlock(
(norm1): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
(qkv): Linear(in_features=192, out_features=576, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=192, out_features=192, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath(drop_prob=0.055)
(norm2): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=192, out_features=768, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=768, out_features=192, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
)
(downsample): PatchMerging(
(reduction): Linear(in_features=768, out_features=384, bias=False)
(norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)
(2): BasicLayer(
(blocks): ModuleList(
(0): SwinTransformerBlock(
(norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
(qkv): Linear(in_features=384, out_features=1152, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath(drop_prob=0.073)
(norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
(1): SwinTransformerBlock(
(norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
(qkv): Linear(in_features=384, out_features=1152, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath(drop_prob=0.091)
(norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
(2): SwinTransformerBlock(
(norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
(qkv): Linear(in_features=384, out_features=1152, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath(drop_prob=0.109)
(norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
(3): SwinTransformerBlock(
(norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
(qkv): Linear(in_features=384, out_features=1152, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath(drop_prob=0.127)
(norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
(4): SwinTransformerBlock(
(norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
(qkv): Linear(in_features=384, out_features=1152, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath(drop_prob=0.145)
(norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
(5): SwinTransformerBlock(
(norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
(qkv): Linear(in_features=384, out_features=1152, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath(drop_prob=0.164)
(norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
)
(downsample): PatchMerging(
(reduction): Linear(in_features=1536, out_features=768, bias=False)
(norm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
)
)
(3): BasicLayer(
(blocks): ModuleList(
(0): SwinTransformerBlock(
(norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=768, out_features=768, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath(drop_prob=0.182)
(norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
(1): SwinTransformerBlock(
(norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=768, out_features=768, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath(drop_prob=0.200)
(norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
)
)
)
(norm1): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(1): PositionEmbeddingSineHW()
)
(bbox_embed): ModuleList(
(0-5): 6 x MLP(
(layers): ModuleList(
(0-1): 2 x Linear(in_features=256, out_features=256, bias=True)
(2): Linear(in_features=256, out_features=4, bias=True)
)
)
)
(class_embed): ModuleList(
(0-5): 6 x ContrastiveEmbed()
)
)
无需输入尺寸,可直接打印模型摘要。
from torchsummary import summary
summary(model)
Output:
=====================================================================================
Layer (type:depth-idx) Param #
=====================================================================================
├─Transformer: 1-1 --
| └─TransformerEncoder: 2-1 --
| | └─ModuleList: 3-1 7,693,056
| | └─ModuleList: 3-2 4,738,560
| | └─ModuleList: 3-3 9,474,048
| └─TransformerDecoder: 2-2 --
| | └─ModuleList: 3-4 10,857,216
| | └─LayerNorm: 3-5 512
| | └─MLP: 3-6 197,120
| | └─ModuleList: 3-7 132,612
| | └─ModuleList: 3-8 --
| └─Embedding: 2-3 230,400
| └─Linear: 2-4 65,792
| └─LayerNorm: 2-5 512
| └─MLP: 2-6 --
| | └─ModuleList: 3-9 132,612
| └─ContrastiveEmbed: 2-7 --
├─BertModelWarper: 1-2 --
| └─BertEmbeddings: 2-8 --
| | └─Embedding: 3-10 23,440,896
| | └─Embedding: 3-11 393,216
| | └─Embedding: 3-12 1,536
| | └─LayerNorm: 3-13 1,536
| | └─Dropout: 3-14 --
| └─BertEncoder: 2-9 --
| | └─ModuleList: 3-15 85,054,464
| └─BertPooler: 2-10 --
| | └─Linear: 3-16 (590,592)
| | └─Tanh: 3-17 --
├─Linear: 1-3 196,864
├─ModuleList: 1-4 --
| └─Sequential: 2-11 --
| | └─Conv2d: 3-18 49,408
| | └─GroupNorm: 3-19 512
| └─Sequential: 2-12 --
| | └─Conv2d: 3-20 98,560
| | └─GroupNorm: 3-21 512
| └─Sequential: 2-13 --
| | └─Conv2d: 3-22 196,864
| | └─GroupNorm: 3-23 512
| └─Sequential: 2-14 --
| | └─Conv2d: 3-24 1,769,728
| | └─GroupNorm: 3-25 512
├─Joiner: 1-5 --
| └─SwinTransformer: 2-15 --
| | └─PatchEmbed: 3-26 4,896
| | └─Dropout: 3-27 --
| | └─ModuleList: 3-28 27,512,922
| | └─LayerNorm: 3-29 384
| | └─LayerNorm: 3-30 768
| | └─LayerNorm: 3-31 1,536
| └─PositionEmbeddingSineHW: 2-16 --
├─ModuleList: 1-6 (recursive)
| └─MLP: 2-17 (recursive)
| | └─ModuleList: 3-32 132,612
├─ModuleList: 1-7 --
| └─ContrastiveEmbed: 2-18 --
=====================================================================================
Total params: 172,971,270
Trainable params: 172,380,678
Non-trainable params: 590,592
=====================================================================================