使用 GloVe 在 R 中预训练词嵌入

Question

我正在尝试使用 GloVe 在 R 中使用预先训练的词嵌入。我有一个只有 40,000 个 token 的小语料库。有 30 条文本，有 3 个文档变量：议长、政党、政府任职年数。数据已通过删除停用词和标点符号进行清理。如何训练手套模型以便找到单词之间的余弦距离？该语料库由政治声明组成，其中只有 30 个。这可能吗？我在liune上看到的大量代码只使用预先训练的嵌入，没有其他数据。

我已经做到了：

summary(df)

text：长度类模式 30个字符

政府年数：分钟。第一曲。第三曲区中位数平均值。最大限度。 1.000 2.000 3.000 3.867 5.750 11.000

派对：长度类别模式 30个字符

扬声器：长度等级模式 30个字符

it <- itoken(df$budgetTextClean, 
                   tokenizer = word_tokenizer,
                   ids = df$speaker,
                   progressbar = TRUE)

vocab <- create_vocabulary(it) # use uni-grams

修剪低频词的词汇

vocab <- prune_vocabulary(vocab, term_count_min = 3)

词汇表里有什么？

print(vocab)

vectorizer <- vocab_vectorizer(vocab)

创建一个Term-Count-Matrix，默认情况下它将使用5（对称）的skipgram窗口

tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)

在加权函数中使用的最大共现数，我们选择整个标记集除以 100

x_max <- length(vocab$doc_count)/100

设置嵌入矩阵和拟合模型

glove_model <- GloVe$new(rank = 300, x_max = 10) 
glove_embedding <-  glove_model$fit_transform(tcm, n_iter = 20, convergence_tol = 0.01, n_threads = 4)

将主嵌入和上下文嵌入（总和）合并到一个矩阵中

glove_embedding = glove_embedding+ t(glove_model$components) # the transpose of the context matrix

使用 GloVe 在 R 中预训练词嵌入

问题描述投票：0回答：0

最新问题

使用 GloVe 在 R 中预训练词嵌入

问题描述 投票：0回答：0

最新问题

问题描述投票：0回答：0