模型阿拉伯语矢量 spacy

问题描述 投票:0回答:0

干草,我正在尝试为 word 制作阿拉伯语语言向量模型,我使用这个 git hubhttps://github.com/bakrianoo/aravec/blob/master/aravec-with-spacy.ipynb

我的代码中有很多东西与这个 git hub 不同所以这是我的代码:

import gensim
import spacy
import re


# Clean/Normalize Arabic Text
def clean_str(text):
    search = ["أ","إ","آ","ة","_","-","/",".","،"," و "," يا ",'"',"ـ","'","ى","\\",'\n', '\t','"','?','؟','!']
    replace = ["ا","ا","ا","ه"," "," ","","",""," و"," يا","","","","ي","",' ', ' ',' ',' ? ',' ؟ ',' ! ']
    
    #remove tashkeel
    p_tashkeel = re.compile(r'[\u0617-\u061A\u064B-\u0652]')
    text = re.sub(p_tashkeel,"", text)
    
    #remove longation
    p_longation = re.compile(r'(.)\1+')
    subst = r"\1\1"
    text = re.sub(p_longation, subst, text)
    
    text = text.replace('وو', 'و')
    text = text.replace('يي', 'ي')
    text = text.replace('اا', 'ا')
    
    for i in range(0, len(search)):
        text = text.replace(search[i], replace[i])
    
    #trim    
    text = text.strip()

    return text
# load the AraVec model
model = gensim.models.Word2Vec.load("full_grams_cbow_100_twitter.mdl")
print("We've",len(model.wv.index_to_key ),"vocabularies")

# make a directory called "spacyModel"
%mkdir spacyModel

# export the word2vec fomart to the directory
model.wv.save_word2vec_format("./spacyModel/aravec.txt")

import gzip
content = b"./spacyModel/aravec.txt"
f = gzip.open('./spacyModel/aravec.txt.gz', 'wb')
f.write(content)
f.close()

然后这条线 :

!python -m spacy init vectors ar ./spacyModel/aravec.gz ./spacyModel/spacy.aravec.model

这是输出:

[i] Creating blank nlp object for language 'ar'
[+] Successfully converted 1 vectors
[+] Saved nlp object with vectors to output directory. You can now use the path
to it in your config as the 'vectors' setting in [initialize].
C:\Users\Roqiua\Desktop\new\aravec-master\aravec-master\spacyModel\spacy.aravec.model
[2023-03-01 14:19:04,041] [INFO] Reading vectors from spacyModel\aravec.gz

0it [00:00, ?it/s]
1it [00:00, ?it/s]
[2023-03-01 14:19:04,367] [INFO] Loaded vectors from spacyModel\aravec.gz

但是当我尝试测试这个模型时,输出是空数组:

nlp = spacy.load("./spacyModel/spacy.aravec.model/")

# Define the preprocessing Class
class Preprocessor:
    def __init__(self, tokenizer, **cfg):
        self.tokenizer = tokenizer

    def __call__(self, text):
        preprocessed = clean_str(text)
        return self.tokenizer(preprocessed)

# Apply the `Preprocessor` Class
nlp.tokenizer = Preprocessor(nlp.tokenizer)

# Test your model
nlp("قطة").vector

输出

array([], dtype=float32)

哪里有问题?

machine-learning nlp spacy arabic
© www.soinside.com 2019 - 2024. All rights reserved.