如何在spaCy中改进德语文本分类模型

Question

我正在一个文本分类项目中，为此使用spacy。现在，我的准确度几乎等于70％，但这还不够。在过去的两周中，我一直在尝试改进模型，但是到目前为止没有成功的结果。在这里，我正在寻找有关应该做什么或尝试的建议。任何帮助将不胜感激！

所以，这是我到目前为止所做的：

1）准备数据：

我有21个类别（例如POLITICS，ECONOMY，SPORT，CELEBRITIES等）的德国新闻不平衡数据集。为了使类别相等，我复制了小类。结果，我有21个文件，文本几乎700 000行。然后，我使用以下代码对该数据进行规范化：

import spacy
from charsplit import Splitter

POS = ['NOUN', 'VERB', 'PROPN', 'ADJ', 'NUM']  # allowed parts of speech

nlp_helper = spacy.load('de_core_news_sm')
splitter = Splitter()

def normalizer(texts):
    arr = []  # list of normalized texts (will be returned from the function as a result of normalization)

    docs = nlp_helper.pipe(texts)  # creating doc objects for multiple lines
    for doc in docs:  # iterating through each doc object
        text = []  # list of words in normalized text
        for token in doc:  # for each word in text
            token = token.lemma_.lower()

            if token not in stop_words and token.pos_ in POS:  # deleting stop words and some parts of speech
                if len(word) > 8 and token.pos_ == 'NOUN':  # only nouns can be splitted
                    _, word1, word2 = splitter.split_compound(word)[0]  # checking only the division with highest prob
                    word1 = word1.lower()
                    word2 = word2.lower()
                    if word1 in german and word2 in german:
                        text.append(word1)
                        text.append(word2)
                    elif word1[:-1] in german and word2 in german:  # word[:-1] - checking for 's' that joins two words
                        text.append(word1[:-1])
                        text.append(word2)
                    else:
                        text.append(word)
                else:
                    text.append(word)
        arr.append(re.sub(r'[.,;:?!"()-=_+*&^@/\']', ' ', ' '.join(text))) # delete punctuation
    return arr

对以上代码的一些解释：

POS-允许的词性列表。如果我目前正在使用的单词是该列表中不在的词性->我将其删除。

[stop_words-仅是我删除的单词的列表。

splitter.split_compound(word)[0]-返回最有可能将复合词划分为一个元组（我用它将长的德语词划分为更短且使用更广泛的词）。这是具有此功能的存储库的link。

总结：我找到了单词的引理，将其变为小写，删除停用词和某些词性，划分复合词，删除标点符号。然后，我将所有单词连接起来并返回归一化行的数组。

2）训练模型

我使用de_core_news_sm训练模型（将来使该模型不仅可以用于分类，而且可以用于归一化）。这是训练代码：

nlp = spacy.load('de_core_news_sm')

textcat = nlp.create_pipe('textcat', config={"exclusive_classes": False, "architecture": 'simple_cnn'})
nlp.add_pipe(textcat, last=True)
for category in categories:
    textcat.add_label(category)

pipe_exceptions = ["textcat"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

with nlp.disable_pipes(*other_pipes):
    optimizer = nlp.begin_training()

    for i in range(n_iter):
        shuffle(data)
        batches = spacy.util.minibatch(data)

        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=0.25)

对以上代码的一些解释：

[data-列表列表，其中每个列表包括一行文本和带有类别的字典（就像docs中的一样]]

'类别'-类别列表

'n_iter'-训练迭代次数

3）最后，我只用to_disk方法保存了模型。

通过以上代码，我成功地训练了70％的模型。这是到目前为止我为提高这一成绩所尝试的列表：

1）使用另一种体系结构（ensemble-未做任何改进
2）对非标准化数据的训练-结果差很多
3）使用预训练的BERT模型-无法做到（here是我对此的未解答问题）
4）训练de_core_news_md而不是de_core_news_sm-没有得到任何改善（请尝试一下，因为根据docs，由于矢量的原因，可能会有改善（如果我正确理解的话）。我错了]
5）数据培训，以略有不同的方式进行了规范化（没有较低的大写字母和标点符号删除-没有任何改进

所以现在我对下一步的工作有些困惑。如果有任何提示或建议，我将不胜感激。

感谢您的帮助！

Answer 1

我建议的第一件事是增加批量大小。之后，您没有看到代码的优化器（如果可能的话，亚当）和学习率。您最终可以尝试更改辍学。

此外，如果您尝试使用神经网络并计划进行大量更改，那么最好切换到PyTorch或TensorFlow。在PyTorch中，您将拥有HuggingFace库，该库内置了BERT。

希望这对您有帮助！

如何在spaCy中改进德语文本分类模型

问题描述投票：0回答：1

1个回答

最新问题

如何在spaCy中改进德语文本分类模型

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1