为什么在使用Python的wordcloud库时,停顿词没有被排除在word cloud中?

问题描述 投票:0回答:1

我想将'The'、'They'和'My'排除在我的word cloud中。我使用了如下的python库 "wordcloud",并在STOPWORDS列表中添加了这3个额外的停顿字,但wordcloud仍然包含了它们。我需要怎么改,才能把这3个词排除在外?

screenshot of my code

我导入的库是。

import numpy as np
import pandas as pd
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

我试着在下面的STOPWORDS集合中添加元素 但是,即使单词被成功添加,wordcloud仍然显示我添加到STOPWORDS集合中的3个单词。

len(STOPWORDS)输出: 192

然后我就跑了

STOPWORDS.add('The')
STOPWORDS.add('They')
STOPWORDS.add('My')

然后我就跑了

len(STOPWORDS)输出: 195

我运行的是python 3.7.3版本。

我知道我可以在运行wordcloud之前修改文本输入以删除这3个单词(而不是尝试修改WordCloud的STOPWORDS设置),但我想知道是否WordCloud存在一个错误,或者我没有正确使用STOPWORDS更新?

python nlp word-cloud stop-words
1个回答
2
投票

Wordcloud的默认值是 collocations=True因此,两个相邻词的频繁短语会被包含在云中--而且对于您的问题来说,重要的是,对于搭配,去除停顿词是不同的,因此,例如 "Thank you "是一个有效的搭配,可能会出现在生成的云中,即使 "you "在默认的停顿词中。只包含停顿词的搭配 删除。

这个听起来不无道理的理由是,如果在建立搭配列表之前删除停顿词,那么例如 "thank you very much "就会提供 "thank very "作为搭配,这绝对不是我想要的。

所以,为了让你的停顿词可以或许按照你的预期工作,即云中完全不出现停顿词,你可以使用 collocations=False 像这样。

my_wordcloud = WordCloud(
    stopwords=my_stopwords,
    background_color='white', 
    collocations=False, 
    max_words=10).generate(all_tweets_as_one_string)

UPDATE:

  • 当搭配为False时,停止词都是小写的,以便在删除它们时与小写的文本进行比较--所以不需要添加 "The "等。
  • 如果搭配为True(默认),当停止词是小写的时候,当寻找所有停止词的搭配来删除它们时,源文本不会被小写,所以例如g The 的文字不会被删除,而 the 被删除了--这就是为什么 @Balaji Ambresh 的代码可以工作,你会看到云中没有盖子。这可能是Wordcloud的一个缺陷,不确定。不过在停止词中添加e.g. The 到 stopwords 不会影响这一点,因为 stopwords 始终是小写的,不管是否有拼写 TrueFalse

这些在源代码中都可以看到:-)

例如,在默认的 collocations=True 我明白了。

enter image description here

而随着 collocations=False 我得到了。enter image description here

代码:

from wordcloud import WordCloud
from matplotlib import pyplot as plt


text = "The bear sat with the cat. They were good friends. " + \
        "My friend is a bit bear like. He's lovely. The bear, the cat, the dog and me were all sat " + \
        "there enjoying the view. You should have seen it. The view was absolutely lovely. " + \
            "It was such a lovely day. The bear was loving it too."

cloud = WordCloud(collocations=False,
        background_color='white',
        max_words=10).generate(text)
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()

0
投票
pip install nltk

不要忘记安装停止符

python
>>> import nltk
>>> nltk.download('stopwords')

试试这个。

from wordcloud import WordCloud
from matplotlib import pyplot as plt

from nltk.corpus import stopwords

stopwords = set(stopwords.words('english'))

text = "The bear sat with the cat. They were good friends. " + \
        "My friend is a bit bear like. He's lovely. The bear, the cat, the dog and me were all sat " + \
        "there enjoying the view. You should have seen it. The view was absolutely lovely. " + \
            "It was such a lovely day. The bear was loving it too."
cloud = WordCloud(stopwords=stopwords,
        background_color='white',
        max_words=10).generate(text.lower())
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()
© www.soinside.com 2019 - 2024. All rights reserved.