向 Thinc 模型添加线性层示例 - 通过模型架构了解数据维度

问题描述 投票:0回答:1

尝试了解使用 Spacy 训练的模型的内部运作原理,Thinc 模型就是这样。查看本教程,我正在修改模型以查看哪些内容会损坏以及哪些内容有效。我没有进行标记,而是对其进行修改以适应 NER 数据集,其中包含 16 个类。我想在本教程中概述的 TransformerTokenizer + Transformer 层之后添加几个层,但我收到了大量维度 ValueErrors。另外,对我来说很重要的是 TransformersTagger 层输出给定 Transformer 模型的最后一个隐藏层,我不确定这段代码正在做什么。这是我收到的错误:

ValueError: Attempt to change dimension 'nI' for model 'linear' from 512 to 16

这是我迄今为止的完整代码改编。公平地说,我不喜欢 Linear() 层之前有一个 softmax(num_ner_classes) ,但在 Transformer 层之后我无法使用 with_array() 来处理其他任何内容:

@dataclass
class TokensPlus:
    batch_size: int
    tok2wp: List[Ints1d]
    input_ids: torch.Tensor
    token_type_ids: torch.Tensor
    attention_mask: torch.Tensor

    def __init__(self, inputs: List[List[str]], wordpieces: BatchEncoding):
        self.input_ids = wordpieces["input_ids"]
        self.attention_mask = wordpieces["attention_mask"]
        self.token_type_ids = wordpieces["token_type_ids"]
        self.batch_size = self.input_ids.shape[0]
        self.tok2wp = []
        for i in range(self.batch_size):
            print(i, inputs[i])
            spans = [wordpieces.word_to_tokens(i, j) for j in range(len(inputs[i]))]
            print(spans)
            self.tok2wp.append(self.get_wp_starts(spans))

    def get_wp_starts(self, spans: List[Optional[TokenSpan]]) -> Ints1d:
        """Calculate an alignment mapping each token index to its first wordpiece."""
        alignment = numpy.zeros((len(spans)), dtype="i")
        for i, span in enumerate(spans):
            if span is None:
                raise ValueError(
                    "Token did not align to any wordpieces. Was the tokenizer "
                    "run with is_split_into_words=True?"
                )
            else:
                alignment[i] = span.start
        return alignment

@thinc.registry.layers("transformers_tokenizer.v1")
def TransformersTokenizer(name: str) -> Model[List[List[str]], TokensPlus]:
    def forward(model, inputs: List[List[str]], is_train: bool):
        tokenizer = model.attrs["tokenizer"]
        wordpieces = tokenizer(
            inputs,
            is_split_into_words=True,
            add_special_tokens=True,
            return_token_type_ids=True,
            return_attention_mask=True,
            return_length=True,
            return_tensors="pt",
            padding="longest"
        )
        return TokensPlus(inputs, wordpieces), lambda d_tokens: []

    return Model("tokenizer", forward, attrs={"tokenizer": AutoTokenizer.from_pretrained(name)})

def convert_transformer_inputs(model, tokens: TokensPlus, is_train):
    kwargs = {
        "input_ids": tokens.input_ids,
        "attention_mask": tokens.attention_mask,
        "token_type_ids": tokens.token_type_ids,
    }
    return ArgsKwargs(args=(), kwargs=kwargs), lambda dX: []

def convert_transformer_outputs(model: Model, inputs_outputs: Tuple[TokensPlus, Tuple[torch.Tensor]], is_train: bool) -> Tuple[List[Floats2d], Callable]:
    tplus, trf_outputs = inputs_outputs
    wp_vectors = torch2xp(trf_outputs[0])
    tokvecs = [wp_vectors[i, idx] for i, idx in enumerate(tplus.tok2wp)]

    def backprop(d_tokvecs: List[Floats2d]) -> ArgsKwargs:
        # Restore entries for BOS and EOS markers
        d_wp_vectors = model.ops.alloc3f(*trf_outputs[0].shape, dtype="f")
        for i, idx in enumerate(tplus.tok2wp):
            d_wp_vectors[i, idx] += d_tokvecs[i]
        return ArgsKwargs(
            args=(trf_outputs[0],),
            kwargs={"grad_tensors": xp2torch(d_wp_vectors)},
        )

    return tokvecs, backprop

@thinc.registry.layers("transformers_encoder.v1")
def Transformer(name: str = "bert-large-cased") -> Model[TokensPlus, List[Floats2d]]:
    return PyTorchWrapper(
        AutoModel.from_pretrained(name),
        convert_inputs=convert_transformer_inputs,
        convert_outputs=convert_transformer_outputs,
    )

@thinc.registry.layers("TransformersNer.v1")
def TransformersNer(name: str, num_ner_classes: int = 16) -> Model[List[List[str]], List[Floats2d]]:
    return chain(
        TransformersTokenizer(name),
        Transformer(name),
        with_array(Softmax(num_ner_classes)),
        Linear(512, 1024)
    )

如何最好地确定如何将 PyTorchWrapped TransformersTagger 层的输出通过管道传输到 Linear() + 链下的更多层?我一直在使用这种模型可视化,但即使当我在数据的第一个示例上运行 model.initialize() 时,仍然有很多 (?, ?)。

import pydot

def visualize_model(model):
    def get_label(layer):
        layer_name = layer.name
        nO = layer.get_dim("nO") if layer.has_dim("nO") else "?"
        nI = layer.get_dim("nI") if layer.has_dim("nI") else "?"
        return f"{layer.name}|({nO}, {nI})".replace(">", ">")
    dot = pydot.Dot()
    dot.set("rankdir", "LR")
    dot.set_node_defaults(shape="record", fontname="arial", fontsize="10")
    dot.set_edge_defaults(arrowsize="0.7")
    nodes = {}
    for i, layer in enumerate(model.layers):
        label = get_label(layer)
        node = pydot.Node(layer.id, label=label)
        dot.add_node(node)
        nodes[layer.id] = node
        if i == 0:
            continue
        from_node = nodes[model.layers[i - 1].id]
        to_node = nodes[layer.id]
        if not dot.get_edge(from_node, to_node):
            dot.add_edge(pydot.Edge(from_node, to_node))
    print(dot)

产品:

digraph G {
rankdir=LR;
node [fontname=arial, fontsize=10, shape=record];
edge [arrowsize="0.7"];
176 [label="tokenizer|(?, ?)"];
177 [label="pytorch|(?, ?)"];
176 -> 177;
179 [label="with_array(softmax)|(16, 1024)"];
177 -> 179;
180 [label="linear|(512, 1024)"];
179 -> 180;
}
python nlp neural-network spacy spacy-transformers
1个回答
0
投票

好的,以下是如何遍历 Thinc 模型链以检查数据形状在层之间如何变化的方法。对于

model.layers
中的每一层,您可以执行以下操作并查看每层输出的数据类型和形状。所以我最终做的是修改
convert_transformer_output
并返回
Ragged
tokvecs
表示,而不是可变长度嵌入列表(因为不同的句子比另一个句子有更多的标记)。

    ...
    tokvecs = [wp_vectors[i, idx] for i, idx in enumerate(tplus.tok2wp)]
    tokvecs_ragged = list2ragged().predict(tokvecs).data
    ...
    return tokvecs_ragged, backprop

Ragged 的内存效率比 Plated 更高,并且它直接馈入 Linear 层,因此我根本不再需要

with_array()
(其目的也是返回 numpy 数组)。这样我就可以获得我想要的形状,并且可以将 Softmax 保存到架构的最后(我打算添加更多层)。我只花了一点时间查看文档。这是最具启发性的(参见 Chain Combinator here)。

    tok = model.layers[0]    # Tokenizer
    out, dX = tok(train_X[:5], is_train=False)
    print(out.input_ids.shape)
    trans = model.layers[1]  # Transformer
    out, dY = trans(out, is_train= False)
    print(out.shape)
    lin = model.layers[2] # Linear
    out, dZ = lin(out, is_train=False)
    print([el.shape for el in out])

© www.soinside.com 2019 - 2024. All rights reserved.