删除停用词并仅选择熊猫中的名字

Question

我正在尝试按日期提取热门单词，如下所示：

df.set_index('Publishing_Date').Quotes.str.lower().str.extractall(r'(\w+)')[0].groupby('Publishing_Date').value_counts().groupby('Publishing_Date')

在以下数据框中：

import pandas as pd 

# initialize 
data = [['20/05', "So many books, so little time." ], ['20/05', "The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid." ], ['19/05', 
"Don't be pushed around by the fears in your mind. Be led by the dreams in your heart."], ['19/05', "Be the reason someone smiles. Be the reason someone feels loved and believes in the goodness in people."], ['19/05', "Do what is right, not what is easy nor what is popular."]] 

# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['Publishing_Date', 'Quotes'])

您怎么看，有很多停用词（"the", "an", "a", "be", ...），为了更好的选择，我想删除这些停用词。我的目标是在日期之前找到一些常用的关键词，即模式，这样我会更感兴趣，并专注于名称而不是动词。

关于如何删除停用词并仅保留名称的任何想法？

编辑

预期的输出（基于以下Vaibhav Khandelwal的回答的结果：

Publishing_Date         Quotes       Nouns
  20/05                 ....        books, time, person, gentleman, lady, novel
19/05                   ....        fears, mind, dreams, heart, reason, smiles

我只需要提取名词（原因应该更频繁，以便根据频率进行排序）。

我认为标记（（NN））中的nltk.pos_tag应该有用。

Answer 1

这是从文本中删除停用词的方法：

import nltk
from nltk.corpus import stopwords

def remove_stopwords(text):
    stop_words = stopwords.words('english')
    fresh_text = []

    for i in text.lower().split():
        if i not in stop_words:
            fresh_text.append(i)

    return(' '.join(fresh_text))

df['text'] = df['Quotes'].apply(remove_stopwords)

注意：如果要删除单词，请在停用词列表中显式追加

对于另一半，您可以添加另一个函数来提取名词：

def extract_noun(text):
token = nltk.tokenize.word_tokenize(text)
result=[]
for i in nltk.pos_tag(token):
    if i[1].startswith('NN'):
        result.append(i[0])

return(', '.join(result))

df ['NOUN'] = df ['text']。apply（extract_noun）

最终输出将如下：

删除停用词并仅选择熊猫中的名字

问题描述投票：1回答：1

1个回答

最新问题

删除停用词并仅选择熊猫中的名字

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1