我正在尝试使用以下代码从制表符分隔的.txt文件中删除停用词:
import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
file = open('textposts_01.txt', encoding='UTF-8')
stop_words = set(stopwords.words('english'))
line = file.read()
words = line.split()
for r in words:
if not r in stop_words:
appendFile = open('textposts_02.txt', mode='a', encoding='UTF-8')
appendFile.write(" "+r)
appendFile.close()
代码成功执行,但是当我查看结果时,所有行均已重新写入一行。删除停用词时如何维护列?
我在类似的帖子中找到了以下解决方案:
import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
file = open('textposts_01.txt', encoding='UTF-8')
stop_words = set(stopwords.words('english'))
line = file.read()
words = line.split()
for r in words:
if not r in stop_words:
appendFile = open('textposts_02.txt', mode='a', encoding='UTF-8')
appendFile.write(" "+r)
appendFile.write("\n")
appendFile.close()
但是插入一个新行只是在每个单词之后创建一个新行,因此,如果我从这样的行开始:
0 make a list of every person you know
结果看起来像这样:
0
make
list
every
person
know
而且我需要将结果放在这样的行中:
0 make list every person
我已经搜索了一段时间,但没有找到任何解决方案。
appendFile.write(" "+r)
完成每一行后,您可以循环浏览文件并添加换行符。