有没有办法删除文本中不在其他文本中的所有单词？

Question

我有一份包含很多评论的文件。我正在使用TfidfVectorizer创建一个词袋BW。我想要做的是：我只想在BW中使用也在其他文档D中的单词。

文档D是具有正面词语的文档。我正在使用这个积极来改善我的模型。我的意思是：我只想数一些积极的词。

有办法做到这一点吗？

谢谢

我创建了一段代码来完成这项工作，就像休闲：train_x是带有评论的熊猫数据框架。

pos_file = open("positive-words.txt")
neg_file = open("negative-words.txt")

#creating arrays based on the files
for ln in pos_file:
    pos_words.append(ln.strip())
for ln in neg_file:
    neg_words.append(ln.strip())

#adding all the positive and negative words together
sentiment_words.append(pos_words)
sentiment_words.append(neg_words)

pos_file.close()
neg_file.close()

#filtering all the words that are not in the sentiment array
filtered_res =[]
for r in train_x:
    keep = []
    parts = r.split()
    for p in parts:
        if p in pos_words:
            keep.append(p)
    #turning the Review array back to text again
    filtered_res.append(" ".join(keep))

train_x = filtered_res

虽然我能够满足我的需求，但我知道代码并不是最好的。另外，我试图在python中找到一个标准函数来做到这一点

PS：Python有很多功能，我总是在不使用我使用的代码量的情况下问它能做什么

Answer 1

这是一个更优化的版本（因为

它不会在循环中的pos_words中进行线性搜索p
它矢量化循环（更pythonic）
而不是保留每个r的列表，它有生成器版本


import re

pos_words_set = set (pos_words)

def filter (r):
    keep = []
    # use [A-Za-z] to avoid numbers
    for p in re.finditer(r"[A-Za-z0-9]+", string):
        if p in pos_words_set:
            keep.append(p)
    return " ".join(keep)

train_x = train_x.apply(lambda x : filter(x), axis=1)

有没有办法删除文本中不在其他文本中的所有单词？

问题描述投票：-1回答：1

1个回答

最新问题

有没有办法删除文本中不在其他文本中的所有单词？

问题描述 投票：-1回答：1

1个回答

最新问题

问题描述投票：-1回答：1