如何基于文档相似性对文本数据进行分组?

问题描述 投票:1回答:2

考虑如下数据框

df = pd.DataFrame({'Questions': ['What are you doing?','What are you doing tonight?','What are you doing now?','What is your name?','What is your nick name?','What is your full name?','Shall we meet?',
                             'How are you doing?' ]})
问题0您在做什么?1你今晚在做什么?2您现在在做什么?3你叫什么名字?4您的昵称是什么?5您的全名是什么?6我们可以见面吗?7你好吗?

如何将数据框与类似的问题分组?即如何获得如下所示的组

for _, i in df.groupby('similarity')['Questions']:
    print(i,'\n')
6我们可以见面吗?名称:问题,dtype:对象3你叫什么名字?4您的昵称是什么?5您的全名是什么?名称:问题,dtype:对象0您在做什么?1你今晚在做什么?2您现在在做什么?7你好吗?名称:问题,dtype:对象

问了一个类似的问题here,但不太清楚,所以对该问题没有疑问

python pandas group-by nltk similarity
2个回答
2
投票

这是一种相当大的方法,它是在系列中所有元素之间找到normalized similarity score,然后通过新获得的相似性列表(转换为字符串)将它们分组。即

import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd

def convert_tag(tag):   
    tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
    try:
        return tag_dict[tag[0]]
    except KeyError:
        return None

def doc_to_synsets(doc):
    """
    Returns a list of synsets in document.

    Tokenizes and tags the words in the document doc.
    Then finds the first synset for each word/tag combination.
    If a synset is not found for that combination it is skipped.

    Args:
        doc: string to be converted

    Returns:
        list of synsets

    Example:
        doc_to_synsets('Fish are nvqjp friends.')
        Out: [Synset('fish.n.01'), Synset('be.v.01'), 
     Synset('friend.n.01')]
    """

    synsetlist =[]
    tokens=nltk.word_tokenize(doc)
    pos=nltk.pos_tag(tokens)    
    for tup in pos:
        try:
            synsetlist.append(wn.synsets(tup[0], convert_tag(tup[1]))[0])
        except:
            continue           
    return synsetlist

def similarity_score(s1, s2):
    """
    Calculate the normalized similarity score of s1 onto s2

    For each synset in s1, finds the synset in s2 with the largest similarity value.
    Sum of all of the largest similarity values and normalize this value by dividing it by the number of largest similarity values found.

    Args:
        s1, s2: list of synsets from doc_to_synsets

    Returns:
        normalized similarity score of s1 onto s2

    Example:
        synsets1 = doc_to_synsets('I like cats')
        synsets2 = doc_to_synsets('I like dogs')
        similarity_score(synsets1, synsets2)
        Out: 0.73333333333333339
    """

    highscores = []
    for synset1 in s1:
        highest_yet=0
        for synset2 in s2:
            try:
                simscore=synset1.path_similarity(synset2)
                if simscore>highest_yet:
                    highest_yet=simscore
            except:
                continue

        if highest_yet>0:
             highscores.append(highest_yet)  

    return sum(highscores)/len(highscores)  if len(highscores) > 0 else 0

def document_path_similarity(doc1, doc2):
    synsets1 = doc_to_synsets(doc1)
    synsets2 = doc_to_synsets(doc2)
    return (similarity_score(synsets1, synsets2) + similarity_score(synsets2, synsets1)) / 2


def similarity(x,df):
    sim_score = []
    for i in df['Questions']:
        sim_score.append(document_path_similarity(x,i))
    return sim_score

通过上面定义的方法,我们现在可以做

df['similarity'] = df['Questions'].apply(lambda x : similarity(x,df)).astype(str)

for _, i in df.groupby('similarity')['Questions']:
    print(i,'\n')

输出:

6我们可以见面吗?名称:问题,dtype:对象3你叫什么名字?4您的昵称是什么?5您的全名是什么?名称:问题,dtype:对象0您在做什么?1你今晚在做什么?2您现在在做什么?7你好吗?名称:问题,dtype:对象

这不是解决问题的最佳方法,而且速度很慢。任何新方法都受到高度赞赏。


0
投票

您应该首先对列表/数据框列中的所有名称进行排序,然后仅对n-1行运行相似性代码,即对每一行,将其与下一个元素进行比较。如果两者相似,则可以将它们分类为1或0,然后解析列表。而不是将每一行与其他所有n ^ 2元素进行比较。

© www.soinside.com 2019 - 2024. All rights reserved.