为什么在使用Python的wordcloud库时，停顿词没有被排除在word cloud中？

Question

我想将'The'、'They'和'My'排除在我的word cloud中。我使用了如下的python库 "wordcloud"，并在STOPWORDS列表中添加了这3个额外的停顿字，但wordcloud仍然包含了它们。我需要怎么改，才能把这3个词排除在外？

我导入的库是。

import numpy as np
import pandas as pd
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

我试着在下面的STOPWORDS集合中添加元素但是，即使单词被成功添加，wordcloud仍然显示我添加到STOPWORDS集合中的3个单词。

len(STOPWORDS)输出： 192

然后我就跑了

STOPWORDS.add('The')
STOPWORDS.add('They')
STOPWORDS.add('My')

然后我就跑了

len(STOPWORDS)输出： 195

我运行的是python 3.7.3版本。

我知道我可以在运行wordcloud之前修改文本输入以删除这3个单词（而不是尝试修改WordCloud的STOPWORDS设置），但我想知道是否WordCloud存在一个错误，或者我没有正确使用STOPWORDS更新？

Answer 1

Wordcloud的默认值是 collocations=True因此，两个相邻词的频繁短语会被包含在云中--而且对于您的问题来说，重要的是，对于搭配，去除停顿词是不同的，因此，例如 "Thank you "是一个有效的搭配，可能会出现在生成的云中，即使 "you "在默认的停顿词中。只包含停顿词的搭配是删除。

这个听起来不无道理的理由是，如果在建立搭配列表之前删除停顿词，那么例如 "thank you very much "就会提供 "thank very "作为搭配，这绝对不是我想要的。

所以，为了让你的停顿词可以或许按照你的预期工作，即云中完全不出现停顿词，你可以使用 collocations=False 像这样。

my_wordcloud = WordCloud(
    stopwords=my_stopwords,
    background_color='white', 
    collocations=False, 
    max_words=10).generate(all_tweets_as_one_string)

UPDATE:

当搭配为False时，停止词都是小写的，以便在删除它们时与小写的文本进行比较--所以不需要添加 "The "等。
如果搭配为True（默认），当停止词是小写的时候，当寻找所有停止词的搭配来删除它们时，源文本不会被小写，所以例如g The 的文字不会被删除，而 the 被删除了--这就是为什么 @Balaji Ambresh 的代码可以工作，你会看到云中没有盖子。这可能是Wordcloud的一个缺陷，不确定。不过在停止词中添加e.g. The 到 stopwords 不会影响这一点，因为 stopwords 始终是小写的，不管是否有拼写 TrueFalse

这些在源代码中都可以看到:-)

例如，在默认的 collocations=True 我明白了。

而随着 collocations=False 我得到了。

代码：

from wordcloud import WordCloud
from matplotlib import pyplot as plt


text = "The bear sat with the cat. They were good friends. " + \
        "My friend is a bit bear like. He's lovely. The bear, the cat, the dog and me were all sat " + \
        "there enjoying the view. You should have seen it. The view was absolutely lovely. " + \
            "It was such a lovely day. The bear was loving it too."

cloud = WordCloud(collocations=False,
        background_color='white',
        max_words=10).generate(text)
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()

Answer 2

pip install nltk

不要忘记安装停止符

python
>>> import nltk
>>> nltk.download('stopwords')

试试这个。

from wordcloud import WordCloud
from matplotlib import pyplot as plt

from nltk.corpus import stopwords

stopwords = set(stopwords.words('english'))

text = "The bear sat with the cat. They were good friends. " + \
        "My friend is a bit bear like. He's lovely. The bear, the cat, the dog and me were all sat " + \
        "there enjoying the view. You should have seen it. The view was absolutely lovely. " + \
            "It was such a lovely day. The bear was loving it too."
cloud = WordCloud(stopwords=stopwords,
        background_color='white',
        max_words=10).generate(text.lower())
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()

为什么在使用Python的wordcloud库时，停顿词没有被排除在word cloud中？

问题描述投票：0回答：1

1个回答

最新问题

为什么在使用Python的wordcloud库时，停顿词没有被排除在word cloud中？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1