从另一个文件的文本中删除文件中定义的所有停用词（Python）

Question

我有两个文本文件：

Stopwords.txt->包含停用词，每行一个
text.txt->大文档文件

我正在尝试从text.txt文件中删除所有出现的停用词（stopwords.txt文件中的任何单词）不使用NLTK（学校作业）。

我将如何去做？到目前为止，这是我的代码。

import re

with open('text.txt', 'r') as f, open('stopwords.txt','r') as st:
    f_content = f.read()
    #splitting text.txt by non alphanumeric characters
    processed = re.split('[^a-zA-Z]', f_content)

    st_content = st.read()
    #splitting stopwords.txt by new line
    st_list = re.split('\n', st_content)
    #print(st_list) to check it was working

    #what I'm trying to do is: traverse through the text. If stopword appears, 
    #remove it. otherwise keep it. 
    for word in st_list:
        f_content = f_content.replace(word, "")
        print(f_content)

但是当我运行代码时，它首先要花费永远的时间来输出某些东西，而当它执行时，它只会输出整个文本文件。（我是python的新手，所以如果我做的根本上是错的，请告诉我！）

Answer 1

我知道Python对于这类事情（以及许多其他事情）确实非常有用，但是如果您有一个很大的text.txt。我会尝试旧的，丑陋且功能强大的命令行“ sed”。

尝试类似的东西：

sed -f stopwords.sed text.txt> output_file.txt

对于stopwords.sed文件，每个停用词必须在不同的行中并使用以下格式：

s|\<xxxxx\>||g

其中'xxxxx'将是停用词本身。

s|\<the\>||g

上面的行将删除所有出现的'the'（不带单引号）

值得一试。

Answer 2

这里是我需要删除英语停用词时使用的语言。我通常也使用nltk的语料库而不是我自己的文件作为停用词。

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
ps = PorterStemmer()

## Remove stop words
stops = set(stopwords.words("english"))
text = [ps.stem(w) for w in text if not w in stops and len(w) >= 3]
text = list(set(text)) #remove duplicates
text = " ".join(text)

对于您的特殊情况，我会做类似的事情：

stops = list_of_words_from_file

让我知道我是否回答了您的问题，不确定是从文件读取还是从茎中读取。

编辑：要从另一个文件的文本中删除文件中定义的所有停用词，我们可以使用str.replace（）

for word in st_list:
    f_content=f_content.replace(word)

Answer 3

我认为这种方法很有效，但是速度非常慢，所以如果有人对如何提高效率有任何建议，我将不胜感激！

import re
from stemming.porter2 import stem as PT


with open('text.txt', 'r') as f, open('stopwords.txt','r') as st:

    f_content = f.read()
    processed = re.split('[^a-zA-Z]', f_content)
    processed = [x.lower() for x in processed]
    processed = [PT(x) for x in processed]
    #print(processed)

    st_content = st.read()
    st_list = set(st_content.split())

    clean_text = [x for x in processed if x not in st_list]
    print clean_text

从另一个文件的文本中删除文件中定义的所有停用词（Python）

问题描述投票：0回答：3

3个回答

最新问题

从另一个文件的文本中删除文件中定义的所有停用词（Python）

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3