尝试了解使用 Spacy 训练的模型的内部运作原理,Thinc 模型就是这样。查看本教程,我正在修改模型以查看哪些内容会损坏以及哪些内容有效。我没有进行标记,而是对其进行修改以适应 NER 数据集,其中包含 16 个类。我想在本教程中概述的 TransformerTokenizer + Transformer 层之后添加几个层,但我收到了大量维度 ValueErrors。另外,对我来说很重要的是 TransformersTagger 层输出给定 Transformer 模型的最后一个隐藏层,我不确定这段代码正在做什么。这是我收到的错误:
ValueError: Attempt to change dimension 'nI' for model 'linear' from 512 to 16
这是我迄今为止的完整代码改编。公平地说,我不喜欢 Linear() 层之前有一个 softmax(num_ner_classes) ,但在 Transformer 层之后我无法使用 with_array() 来处理其他任何内容:
@dataclass
class TokensPlus:
batch_size: int
tok2wp: List[Ints1d]
input_ids: torch.Tensor
token_type_ids: torch.Tensor
attention_mask: torch.Tensor
def __init__(self, inputs: List[List[str]], wordpieces: BatchEncoding):
self.input_ids = wordpieces["input_ids"]
self.attention_mask = wordpieces["attention_mask"]
self.token_type_ids = wordpieces["token_type_ids"]
self.batch_size = self.input_ids.shape[0]
self.tok2wp = []
for i in range(self.batch_size):
print(i, inputs[i])
spans = [wordpieces.word_to_tokens(i, j) for j in range(len(inputs[i]))]
print(spans)
self.tok2wp.append(self.get_wp_starts(spans))
def get_wp_starts(self, spans: List[Optional[TokenSpan]]) -> Ints1d:
"""Calculate an alignment mapping each token index to its first wordpiece."""
alignment = numpy.zeros((len(spans)), dtype="i")
for i, span in enumerate(spans):
if span is None:
raise ValueError(
"Token did not align to any wordpieces. Was the tokenizer "
"run with is_split_into_words=True?"
)
else:
alignment[i] = span.start
return alignment
@thinc.registry.layers("transformers_tokenizer.v1")
def TransformersTokenizer(name: str) -> Model[List[List[str]], TokensPlus]:
def forward(model, inputs: List[List[str]], is_train: bool):
tokenizer = model.attrs["tokenizer"]
wordpieces = tokenizer(
inputs,
is_split_into_words=True,
add_special_tokens=True,
return_token_type_ids=True,
return_attention_mask=True,
return_length=True,
return_tensors="pt",
padding="longest"
)
return TokensPlus(inputs, wordpieces), lambda d_tokens: []
return Model("tokenizer", forward, attrs={"tokenizer": AutoTokenizer.from_pretrained(name)})
def convert_transformer_inputs(model, tokens: TokensPlus, is_train):
kwargs = {
"input_ids": tokens.input_ids,
"attention_mask": tokens.attention_mask,
"token_type_ids": tokens.token_type_ids,
}
return ArgsKwargs(args=(), kwargs=kwargs), lambda dX: []
def convert_transformer_outputs(model: Model, inputs_outputs: Tuple[TokensPlus, Tuple[torch.Tensor]], is_train: bool) -> Tuple[List[Floats2d], Callable]:
tplus, trf_outputs = inputs_outputs
wp_vectors = torch2xp(trf_outputs[0])
tokvecs = [wp_vectors[i, idx] for i, idx in enumerate(tplus.tok2wp)]
def backprop(d_tokvecs: List[Floats2d]) -> ArgsKwargs:
# Restore entries for BOS and EOS markers
d_wp_vectors = model.ops.alloc3f(*trf_outputs[0].shape, dtype="f")
for i, idx in enumerate(tplus.tok2wp):
d_wp_vectors[i, idx] += d_tokvecs[i]
return ArgsKwargs(
args=(trf_outputs[0],),
kwargs={"grad_tensors": xp2torch(d_wp_vectors)},
)
return tokvecs, backprop
@thinc.registry.layers("transformers_encoder.v1")
def Transformer(name: str = "bert-large-cased") -> Model[TokensPlus, List[Floats2d]]:
return PyTorchWrapper(
AutoModel.from_pretrained(name),
convert_inputs=convert_transformer_inputs,
convert_outputs=convert_transformer_outputs,
)
@thinc.registry.layers("TransformersNer.v1")
def TransformersNer(name: str, num_ner_classes: int = 16) -> Model[List[List[str]], List[Floats2d]]:
return chain(
TransformersTokenizer(name),
Transformer(name),
with_array(Softmax(num_ner_classes)),
Linear(512, 1024)
)
如何最好地确定如何将 PyTorchWrapped TransformersTagger 层的输出通过管道传输到 Linear() + 链下的更多层?我一直在使用这种模型可视化,但即使当我在数据的第一个示例上运行 model.initialize() 时,仍然有很多 (?, ?)。
import pydot
def visualize_model(model):
def get_label(layer):
layer_name = layer.name
nO = layer.get_dim("nO") if layer.has_dim("nO") else "?"
nI = layer.get_dim("nI") if layer.has_dim("nI") else "?"
return f"{layer.name}|({nO}, {nI})".replace(">", ">")
dot = pydot.Dot()
dot.set("rankdir", "LR")
dot.set_node_defaults(shape="record", fontname="arial", fontsize="10")
dot.set_edge_defaults(arrowsize="0.7")
nodes = {}
for i, layer in enumerate(model.layers):
label = get_label(layer)
node = pydot.Node(layer.id, label=label)
dot.add_node(node)
nodes[layer.id] = node
if i == 0:
continue
from_node = nodes[model.layers[i - 1].id]
to_node = nodes[layer.id]
if not dot.get_edge(from_node, to_node):
dot.add_edge(pydot.Edge(from_node, to_node))
print(dot)
产品:
digraph G {
rankdir=LR;
node [fontname=arial, fontsize=10, shape=record];
edge [arrowsize="0.7"];
176 [label="tokenizer|(?, ?)"];
177 [label="pytorch|(?, ?)"];
176 -> 177;
179 [label="with_array(softmax)|(16, 1024)"];
177 -> 179;
180 [label="linear|(512, 1024)"];
179 -> 180;
}
好的,以下是如何遍历 Thinc 模型链以检查数据形状在层之间如何变化的方法。对于
model.layers
中的每一层,您可以执行以下操作并查看每层输出的数据类型和形状。所以我最终做的是修改 convert_transformer_output
并返回 Ragged
的 tokvecs
表示,而不是可变长度嵌入列表(因为不同的句子比另一个句子有更多的标记)。
...
tokvecs = [wp_vectors[i, idx] for i, idx in enumerate(tplus.tok2wp)]
tokvecs_ragged = list2ragged().predict(tokvecs).data
...
return tokvecs_ragged, backprop
Ragged 的内存效率比 Plated 更高,并且它直接馈入 Linear 层,因此我根本不再需要
with_array()
(其目的也是返回 numpy 数组)。这样我就可以获得我想要的形状,并且可以将 Softmax 保存到架构的最后(我打算添加更多层)。我只花了一点时间查看文档。这是最具启发性的(参见 Chain Combinator here)。
tok = model.layers[0] # Tokenizer
out, dX = tok(train_X[:5], is_train=False)
print(out.input_ids.shape)
trans = model.layers[1] # Transformer
out, dY = trans(out, is_train= False)
print(out.shape)
lin = model.layers[2] # Linear
out, dZ = lin(out, is_train=False)
print([el.shape for el in out])