如何使用hashmap或其他更好的方法从一组字符串中删除重复的字符串？

Question

我有一个字符串数据集，其中包含超过 2000 万组字符串，字符串长度从 10 到 400 不等，现在我想删除其中重复（或非常相似）的字符串。我发现

faiss

可以解决这个问题，但不确定是否正确，这是我的解决方案，

np.random.seed(42)
data = np.random.random((1000, 128)).astype('float32')

use_gpu = faiss.get_num_gpus() > 0

index = faiss.GpuIndexFlatL2(faiss.StandardGpuResources(), 128) if use_gpu else faiss.IndexFlatL2(128)

index.add(data)
query_vector = np.random.random((1, 128)).astype('float32')
k = 5  
distances, indices = index.search(query_vector, k)

print("the cloest index：", indices)
print("distance is：", distances)

问题是我需要先对字符串进行编码，然后将其放入

index

中？我觉得还是要花时间

任何建议对我都有帮助。

Answer 1

以下是您可以考虑的一般步骤：

使用 Word2Vec 或 FastText 等嵌入模型将每个字符串转换为数字向量。
将向量存储在 FAISS 可以使用的矩阵中。
使用 FAISS 搜索非常相似或重复的字符串。

这是一个使用 Gensim 库中的 Word2Vec 的简单示例：

from gensim.models import Word2Vec
import numpy as np
import faiss

# Example of string data
string_data = ["example sentence one", "example sentence two", "another sentence", ...]

# Building a Word2Vec model
word2vec_model = Word2Vec([sentence.split() for sentence in string_data], vector_size=128, window=5, min_count=1, workers=4)

# Gets a vector for each string
vector_data = np.array([word2vec_model.wv[sentence.split()] for sentence in string_data])

# Building FAISS index
use_gpu = faiss.get_num_gpus() > 0
index = faiss.GpuIndexFlatL2(faiss.StandardGpuResources(), 128) if use_gpu else faiss.IndexFlatL2(128)
index.add(vector_data)

# Example of similar string search
query_vector = np.array([word2vec_model.wv["searching"]])
k = 5
distances, indices = index.search(query_vector, k)

print("Indices of the closest strings:", indices)
print("Distances:", distances)

请务必根据您的需求和数据特点调整参数和方法。

如何使用hashmap或其他更好的方法从一组字符串中删除重复的字符串？

问题描述投票：0回答：1

1个回答

最新问题

如何使用hashmap或其他更好的方法从一组字符串中删除重复的字符串？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1