如何使用 Graph2Vec 嵌入应用文本分类？

Question

我得到了一组文本文档，根据它们所属的主题按目录分隔。我想应用 Graph2Vec，然后使用每个文档的嵌入来训练文本分类器。

我正在使用空手道俱乐部的 Graph2Vec 实现： https://karateclub.readthedocs.io/en/latest/_modules/karateclub/graph_embedding/graph2vec.html#Graph2Vec.fit

阅读他们的代码后，模型的输入似乎应该是 NetworkX 图的列表，并且它将每个图的嵌入输出为列表，其顺序与输入列表中的图相同。

我已经实现了这个功能

document_to_graph()

，将文本文档转换为图形。这是我的代码：

label_encoder = LabelEncoder()

document_folders = glob.glob("dataset")
graphs = []
X = []
y = []

for folder in document_folders:
    for file in glob.glob(os.path.join(folder, '*.txt')):
        graph = document_to_graph(file)
        graphs.append(graph)
        document_class = folder
        y.append(document_class)


# Train the Graph2Vec model
graph2vec_model = Graph2Vec()
graph2vec_model.fit(graphs)
graph_embeddings = graph2vec_model.get_embedding()


X = graph_embeddings
y = label_encoder.fit_transform(y)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


clf = SVC()
clf.fit(X_train, y_train)


y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=label_encoder.classes_)

print(f"Accuracy: {accuracy}")
print(report)

我的问题是我有数千个文档，因此随着图形列表变大，内存就会被填满，即使我针对较小的数据集对其进行测试，它的准确性也相当低。

这可能是什么问题？我的做法正确吗？除了 Karate Club 之外，还有其他替代的 Graph2Vec 实现吗？

Answer 1

请问一下graph2vec输入和读取文件数据的格式？^-^

如何使用 Graph2Vec 嵌入应用文本分类？

问题描述投票：0回答：1

1个回答

最新问题

如何使用 Graph2Vec 嵌入应用文本分类？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1