如何找出停用词并计数是否存在

问题描述 投票:0回答:1

我有一个csv文件,其中包含行中的句子列表,我想找出每行中是否有停用词,如果存在则返回1,如果存在则返回0。如果返回1,我要计算停用词。下面是到目前为止的代码,我只能获取csv中存在的所有停用词,但无法获取每一行。

import pandas as pd
import csv
import nltk
from nltk.tag import pos_tag
from nltk import sent_tokenize,word_tokenize
from collections import Counter
from nltk.corpus import stopwords
nltk.download('stopwords')

top_N = 10

news=pd.read_csv("split.csv",usecols=['STORY'])

newss = news.STORY.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(newss)
word_dist = nltk.FreqDist(words)

stopwords = nltk.corpus.stopwords.words('english')
words_except_stop_dist = nltk.FreqDist(w for w in words if w not in stopwords)


rslt = pd.DataFrame(word_dist.most_common(top_N),
                    columns=['Word', 'Frequency'])
print(rslt)

这是截断的csv文件

id    STORY
0     In the bag
1     What is your name
2     chips, bag

我想将输出保存到新的csv文件中,预期的输出应如下所示

id    STORY                exist     How many
0     In the bag            1           2
1     What is your name     1           4
2     chips bag             0           0
python pandas stop-words
1个回答
1
投票
df = pd.DataFrame({"story":['In the bag', 'what is your name', 'chips, bag']})
stopwords = nltk.corpus.stopwords.words('english')
df['clean'] = df['story'].apply(lambda x : nltk.tokenize.word_tokenize(x.lower().replace(r',', ' ')))
df
    story               clean   
0   In the bag          [in, the, bag]
1   what is your name   [what, is, your, name]
2   chips, bag          [chips, bag]

df['clean'] = df.clean.apply(lambda x : [y  for y in x if y in stopwords])
df['exist'] = df.clean.apply(lambda x : 1 if len(x) > 0 else 0)
df['how many'] = df.clean.apply(lambda x : len(x)) 

df

    story               clean              exist    how many
0   In the bag          [in, the]              1    2
1   what is your name   [what, is, your]       1    3
2   chips, bag          []                     0    0

注意:您可以根据需要更改正则表达式。您可以删除clean列,或在以后需要时保留它。

© www.soinside.com 2019 - 2024. All rights reserved.