当通过 pandas.groupby.agg 循环一个单词时,如何忽略它的其他实例?

问题描述 投票:0回答:1

我有一段代码(见下图),用来匹配每个Location的单词出现次数。我的问题是,它读取了这个词的所有实例。

举个例子:这就是我希望它做的,但是下面的代码计算了所有 "help "的出现,包括 "helping "和 "helped"。

      tidytext2                  |  Location    |    occurrences    
she used to help me              |     Aus      |        1
help is on the way               |     UK       |        1  
Helping is a kind gift           |     UK       |        0
She helped me when I needed it   |     Japan    |        0
Why dont u help me?              |     SA       |        1
Help me! Im hungry help          |     Rwanda   |        2


words = [i[0] for i in pos_freq.most_common()]

for i in words:
    positivedf[i] = positivedf.tidytext2.str.count(i)

funs = {i: 'sum' for i in words}
groupedpos = positivedf.groupby('Location').agg(funs)

我用下面的代码得到了 positive_freq.most_common() 。它返回

import nltk
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
import string
def process_text(text):
    tokens = []
    for line in text:
        toks = tokenizer.tokenize(line)
        toks = [t.lower() for t in toks if t.lower() not in stopwords_list]
        tokens.extend(toks)

    return tokens

tokenizer=TweetTokenizer()
punct = list(string.punctuation)
stopwords_list = stopwords.words('english') + punct 
pos_lines = list(positivedf.tidytext2)
pos_tokens = process_text(pos_lines)
pos_freq = nltk.FreqDist(pos_tokens)
pos_freq.most_common()
[('help', 7)]
pandas dataframe group-by nltk pandas-groupby
1个回答
0
投票

你需要使用regex来处理这个问题。

for i in words:
    positivedf[i] = positivedf.tidytext2.str.count(r'(?<!\S)'+i+'(?!\S)')

如果你想对大小写不敏感的话

for i in words:
        positivedf[i] = positivedf.tidytext2.str.count(r'(?i)(?<!\S)'+i+'(?!\S)')
© www.soinside.com 2019 - 2024. All rights reserved.