我有一段代码(见下图),用来匹配每个Location的单词出现次数。我的问题是,它读取了这个词的所有实例。
举个例子:这就是我希望它做的,但是下面的代码计算了所有 "help "的出现,包括 "helping "和 "helped"。
tidytext2 | Location | occurrences
she used to help me | Aus | 1
help is on the way | UK | 1
Helping is a kind gift | UK | 0
She helped me when I needed it | Japan | 0
Why dont u help me? | SA | 1
Help me! Im hungry help | Rwanda | 2
words = [i[0] for i in pos_freq.most_common()]
for i in words:
positivedf[i] = positivedf.tidytext2.str.count(i)
funs = {i: 'sum' for i in words}
groupedpos = positivedf.groupby('Location').agg(funs)
我用下面的代码得到了 positive_freq.most_common() 。它返回
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
import string
def process_text(text):
tokens = []
for line in text:
toks = tokenizer.tokenize(line)
toks = [t.lower() for t in toks if t.lower() not in stopwords_list]
tokens.extend(toks)
return tokens
tokenizer=TweetTokenizer()
punct = list(string.punctuation)
stopwords_list = stopwords.words('english') + punct
pos_lines = list(positivedf.tidytext2)
pos_tokens = process_text(pos_lines)
pos_freq = nltk.FreqDist(pos_tokens)
pos_freq.most_common()
[('help', 7)]
你需要使用regex来处理这个问题。
for i in words:
positivedf[i] = positivedf.tidytext2.str.count(r'(?<!\S)'+i+'(?!\S)')
如果你想对大小写不敏感的话
for i in words:
positivedf[i] = positivedf.tidytext2.str.count(r'(?i)(?<!\S)'+i+'(?!\S)')