我正在尝试使用此链接中的 word2vec 制作图书推荐系统 https://medium.com/@ashok.1055/building-book-recommendation-system-16f2cdf615f2
当我在推荐功能中使用阿拉伯标题时,它给我一个错误
recommendations("الخليفة")
KeyError Traceback (most recent call last)
/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
9 frames
/usr/local/lib/python3.8/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
/usr/local/lib/python3.8/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
/usr/local/lib/python3.8/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine._get_loc_duplicates()
/usr/local/lib/python3.8/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine._maybe_get_bool_indexer()
/usr/local/lib/python3.8/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine._unpack_bool_indexer()
KeyError: 'الخليفة'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-63-d299fc330241> in <module>
----> 1 recommendations2("الخليفة")
<ipython-input-56-c761695971d2> in recommendations2(title)
16 indices = pd.Series(df1.index, index = df1['Title']).drop_duplicates()
17
---> 18 idx = indices[title]
19 sim_scores = list(enumerate(cosine_similaritiess[idx]))
20 sim_scores = sorted(sim_scores, key = lambda x: x[1], reverse = True)
/usr/local/lib/python3.8/dist-packages/pandas/core/series.py in __getitem__(self, key)
940
941 elif key_is_scalar:
--> 942 return self._get_value(key)
943
944 if is_hashable(key):
/usr/local/lib/python3.8/dist-packages/pandas/core/series.py in _get_value(self, label, takeable)
1049
1050 # Similar to Index.get_value, but we do not fall back to positional
-> 1051 loc = self.index.get_loc(label)
1052 return self.index._get_values_for_loc(self, loc, label)
1053
/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:
KeyError: 'الخليفة'
我正在使用这些方法来预处理数据
# Clean/Normalize Arabic Text
import string
arabic_punctuations = '''`÷×؛<>_()*&^%][ـ،/:"؟.,'{}~¦+|!”…“–ـ'''
english_punctuations = string.punctuation
punctuations_list = arabic_punctuations + english_punctuations
# same as #remove tashkeel in clear_str()
arabic_diacritics = re.compile("""
ّ | # Tashdid
َ | # Fatha
ً | # Tanwin Fath
ُ | # Damma
ٌ | # Tanwin Damm
ِ | # Kasra
ٍ | # Tanwin Kasr
ْ | # Sukun
ـ # Tatwil/Kashida
""", re.VERBOSE)
def remove_diacritics(text):
text = re.sub(arabic_diacritics, '', text)
return text
# --------------------------
def remove_punctuations(text):
translator = str.maketrans('', '', punctuations_list)
return text.translate(translator)
# -----------------------------
def normalize_arabic(text):
text = re.sub("[إأآا]", "ا", text)
text = re.sub("ى", "ي", text)
text = re.sub("ؤ", "ء", text)
text = re.sub("ئ", "ء", text)
text = re.sub("ة", "ه", text)
text = re.sub("گ", "ك", text)
return text
和两个函数
# Generate the average word2vec for each book description
def vectors2(x):
# Creating a list for storing the vectors (description into vectors)
global word_embeddingss
word_embeddingss = []
# Reading the each book description
for line in df1['c']:
avgword2vecc = None
countt = 0
for word in line.split():
if word in google_model.wv.vocab:
countt += 1
if avgword2vecc is None:
avgword2vecc = google_model.wv[word]
else:
avgword2vecc = avgword2vecc + google_model.wv[word]
if avgword2vecc is not None:
avgword2vecc = avgword2vecc / countt
word_embeddingss.append(avgword2vecc)
# Recommending the Top 5 similar books
def recommendations2(title):
# Calling the function vectors
vectors2(df1)
# finding cosine similarity for the vectors
cosine_similaritiess = cosine_similarity(word_embeddingss, word_embeddingss)
# taking the title and book image link and store in new data frame called books
books = df1[['Title', 'Cover']]
#Reverse mapping of the index
indices = pd.Series(df1.index, index = df1['Title']).drop_duplicates()
idx = indices[title]
sim_scores = list(enumerate(cosine_similaritiess[idx]))
sim_scores = sorted(sim_scores, key = lambda x: x[1], reverse = True)
sim_scores = sim_scores[1:6]
book_indices = [i[0] for i in sim_scores]
recommend = books.iloc[book_indices]
for index, row in recommend.iterrows():
response = requests.get(row['Cover'])
img = Image.open(BytesIO(response.content))
plt.figure()
plt.imshow(img)
plt.title(row['Title'])
阿拉伯语书籍的数据集我从这里下载 https://www.kaggle.com/code/jjresnick/jamalon-arabic-books-dataset/data 我在代码中使用谷歌驱动器
df1 =pd.read_csv('/content/drive/MyDrive/Colab Notebooks/jamalon-big.csv')
它适用于英语推荐但不适用于阿拉伯语..所以你能帮我解决这个错误吗? 或者我必须使用其他方法/模型来推荐阿拉伯语?
从理论上讲,如果您有一组很好的阿拉伯语单词向量,那么您选择作为灵感的那篇文章中的相同方法应该适用于阿拉伯语书名/描述。
但是,马上,关于您遇到的具体错误,您应该注意:
KeyError
似乎是针对多词文本,其中包含空格。没有纯阿拉伯语词向量集能够返回多词字符串的查找向量。KeyError
似乎实际上可能来自对 Pandas 数据结构的基于键的查找。但是,从“回溯”中不清楚是 your 代码行启动了导致错误的调用集。您是否遗漏了一些错误消息,或者您的本地解释器/笔记本是否已重新配置以显示更少的回溯帧?使用 all 追溯框架总是更容易理解错误,因此如果您可以使用 all 追溯编辑您的问题(或提出未来的问题),请这样做。但是,您的方法还存在其他一些一般性问题,您应该纠正这些问题以获得更健壮、更易理解和可调试的方法。
global
,在一个函数中声明,然后在另一个函数中使用 - 这使得影响其他计算的因素更难分析。如果你真的需要一些全局/顶级数据结构——比如你的all书籍集合——在顶层声明它。通常,您会希望避免编写直接引用全局的函数,而是在需要的地方传递它。df1
变量的类型/内容 - 你没有以任何方式描述。一个好的问题会通过其设置代码或一些演示输出给出一些提示,说明 df1
变量中包含的内容。x
不是一个好的参数名称;它应该更具描述性地命名。如果vectors2()
是(全局)向量的一次性初始化,来自其他全局df1
,它可能应该不被称为every每次你要求基于-title - 但只有一次,在推荐请求之前和之外。您正在尝试的方法可以工作,但是您应该清理代码命名/组织,&如果仍然遇到类似的错误,请提供改进的详细信息(尤其是更详细的错误回溯,准确显示您的 代码触发错误)进入这个问题或后续问题以获得更具体的帮助。祝你好运!