干草,我正在尝试为 word 制作阿拉伯语语言向量模型,我使用这个 git hub: https://github.com/bakrianoo/aravec/blob/master/aravec-with-spacy.ipynb
我的代码中有很多东西与这个 git hub 不同所以这是我的代码:
import gensim
import spacy
import re
# Clean/Normalize Arabic Text
def clean_str(text):
search = ["أ","إ","آ","ة","_","-","/",".","،"," و "," يا ",'"',"ـ","'","ى","\\",'\n', '\t','"','?','؟','!']
replace = ["ا","ا","ا","ه"," "," ","","",""," و"," يا","","","","ي","",' ', ' ',' ',' ? ',' ؟ ',' ! ']
#remove tashkeel
p_tashkeel = re.compile(r'[\u0617-\u061A\u064B-\u0652]')
text = re.sub(p_tashkeel,"", text)
#remove longation
p_longation = re.compile(r'(.)\1+')
subst = r"\1\1"
text = re.sub(p_longation, subst, text)
text = text.replace('وو', 'و')
text = text.replace('يي', 'ي')
text = text.replace('اا', 'ا')
for i in range(0, len(search)):
text = text.replace(search[i], replace[i])
#trim
text = text.strip()
return text
# load the AraVec model
model = gensim.models.Word2Vec.load("full_grams_cbow_100_twitter.mdl")
print("We've",len(model.wv.index_to_key ),"vocabularies")
# make a directory called "spacyModel"
%mkdir spacyModel
# export the word2vec fomart to the directory
model.wv.save_word2vec_format("./spacyModel/aravec.txt")
import gzip
content = b"./spacyModel/aravec.txt"
f = gzip.open('./spacyModel/aravec.txt.gz', 'wb')
f.write(content)
f.close()
然后这条线 :
!python -m spacy init vectors ar ./spacyModel/aravec.gz ./spacyModel/spacy.aravec.model
这是输出:
[i] Creating blank nlp object for language 'ar'
[+] Successfully converted 1 vectors
[+] Saved nlp object with vectors to output directory. You can now use the path
to it in your config as the 'vectors' setting in [initialize].
C:\Users\Roqiua\Desktop\new\aravec-master\aravec-master\spacyModel\spacy.aravec.model
[2023-03-01 14:19:04,041] [INFO] Reading vectors from spacyModel\aravec.gz
0it [00:00, ?it/s]
1it [00:00, ?it/s]
[2023-03-01 14:19:04,367] [INFO] Loaded vectors from spacyModel\aravec.gz
但是当我尝试测试这个模型时,输出是空数组:
nlp = spacy.load("./spacyModel/spacy.aravec.model/")
# Define the preprocessing Class
class Preprocessor:
def __init__(self, tokenizer, **cfg):
self.tokenizer = tokenizer
def __call__(self, text):
preprocessed = clean_str(text)
return self.tokenizer(preprocessed)
# Apply the `Preprocessor` Class
nlp.tokenizer = Preprocessor(nlp.tokenizer)
# Test your model
nlp("قطة").vector
输出:
array([], dtype=float32)
哪里有问题?