构建 doc2vec 模型并使用 gensim 查找文本相似的评论

问题描述 投票:0回答:0

数据集是Amazon的gz文件里的review数据集

# A function to read the zipped data at a specfic path
#
# How to use:
# PATH = "/path/to/file"
# for line in parse(PATH):
#   do something with line
#
def parse(path):
    g = gzip.open(path, 'r')
    for l in g:
        yield eval(l)

给出如下代码

import os
def read_reviewers_data(fname, min_count=0):
    '''
    Save all reviews into their own product asin files.
    Make sure you have 'product' folder when you run this answer.
    In each file, you can choose your own log structure. In this answer, log 
    strucutre is like 
        "reviewText"\t"reviewerID"\t"helpful"
    Args: 
        fname: dataset file path
        min_count: minimum number of reviews of a product
    Returns:
        none
    '''
    if not os.path.isdir('product'):
        os.makedirs('product')
    asin_list = []
    tmp_list = []
    last_asin = ""
    j = 0
    for i in parse(fname):
        if last_asin != i['asin']:
            if len(tmp_list) > min_count:
                f = open("product/" + last_asin+".txt", 'w')
                for one in tmp_list:
                    f.write(one)
                f.close()
            tmp_list = []
            last_asin = i['asin']
        tmp_list.append(i["reviewText"] + '\t' + i["reviewerID"] +
                    '\t' + handle_helpful(i["helpful"]) + "\n")
        j += 1
        if j > 100000:
            break
            
def handle_helpful(helpful):
    '''
    Helper function for helpful_score calculate
    Args: 
        helpful: list. The first element is the number of people think this is helpful. The second element
            is the total number of people evaluate this comment
    Returns:
        String: number represent helpfulness
    '''
    if helpful[1] != 0:
        helpfulness = 1.0 * helpful[0] / helpful[1]
        return str(helpfulness)
    else:
        return str(0)

根据我的理解,上面的代码创建了一个“product”文件夹,其中包含以 asin 代码命名的 txt 文件。对于每个 txt,它存储评论、reviewer_ID 和他们的帮助分数。

read_reviewers_data("reviews_Electronics_5.json.gz")
class TaggedReviewDocument(object):
    '''
    This class could save all products and review information in its dictionary and generate iter for TaggedDocument
        which could used for Doc2Vec model
    '''
    def __init__(self, dirname):
        self.dirname = dirname
        self.helpfulness = {}  # key:reviewerID value:helpfulness
        self.product = {}      # key:asin value:reviewerID
        self.asin = []

    def __iter__(self):
        for filename in os.listdir(self.dirname):
            asin_code = filename[:-4] #delete ".txt"
            self.product[asin_code] = []
            self.asin.append(asin_code)
            for line in enumerate(open(self.dirname + "/" + filename)):
                line_content = line[1].split("\t")
                self.product[asin_code].append(line_content[1])
                self.helpfulness[line_content[1]] = float(line_content[2])
                yield TaggedDocument(clean_line(line_content[0]), [line_content[1], line_content[2]])
documents = TaggedReviewDocument("product")

我有点不确定“TaggedReviewDocument”函数在这里做什么。它似乎创建了一个包含字典和键值的文档。 第一个任务是创建一个 doc2Vec 模型。我的代码是:

from gensim.models.doc2vec import TaggedDocument, Doc2Vec

model_v = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)

第二个任务是找到相似度得分高于 0.8 的同一产品的前 5 条有用评论。我认为相似度得分是基于评论(文本)的相似度,然后按有用度得分对它们进行排名。但是,我完全不知道如何使用上述文档调用 gensim.model 中的 most_similar 参数。我也不知道如何从文档中提取文本评论。

def find_similar_reviews(asin,reviewer_id):
    '''
    If one review is similar to the specefic review and it is much helpful, save it to a list
    Args: 
        asin: product asin
        reviewer_id: the specific review
    Returns:
        list of reviewer id
        
    '''
    result = []
    #
    
    return result

任何帮助将不胜感激。

python nlp word2vec similarity
© www.soinside.com 2019 - 2024. All rights reserved.