预处理推文，删除@和＃，消除停用词，并从python列表列表中删除用户

Question

我写了下面的代码，但是现在我想p重新处理，所以我转换为更低的代码，写了一些单词以消除停用词，但是它不起作用，我想删除@和＃并还要删除user，可以吗？帮我？




! pip install wget
import wget
url = 'https://raw.githubusercontent.com/dirkhovy/NLPclass/master/data/tweets_en.txt'
wget.download(url, 'tweets_en.txt')
tweets = [line.strip() for line in open('tweets_en.txt', encoding='utf8')]

import spacy
from collections import Counter
# your code here
import itertools
nlp = spacy.load('en')

#Creates a list of lists of tokens
tokens = [[token.text for token in nlp(sentence)] for sentence in tweets[:200]]
print(tokens)


#to lower
token_l=[[w.lower() for w in line] for line in tokens]
token_l[:1]

#remove #

#remove stop word

#remove user

#remove @

from nltk.corpus import stopwords

# filtered_words = [[w for w in line] for line in tokens if w not in # stopwords.words('english')]

Answer 1

始终尝试将代码组织成功能：它们是可重用，可读和可循环的。

纯python中的一个简单示例：

from nltk.corpus import stopwords

users = ['jeff_atwood', 'joel_spolsky', 'anon']
stop_words = [w.lower() for w in stopwords.words()]

def sanitize(input_string):
    """ Sanitize one string """
    string = input_string

    # normalize to lowercase 
    string = string.lower()

    #remove # and @
    for punc in '@#':
       string = string.replace(punc, '')

    #remove stop word and users
    to_remove = stop_words + users
    return ' '.join([w for w in string.split() if w not in to_remove])


list = ['@Jeff_Atwood @Joel_Spolsky Thank you for #stackoverflow', '@anon All hail #stackoverflow']

list_sanitized = [sanitize(string) for string in list]

输出：

['thank stackoverflow', 'hail stackoverflow']

预处理推文，删除@和＃，消除停用词，并从python列表列表中删除用户

问题描述投票：0回答：1

1个回答

最新问题

预处理推文，删除@和＃，消除停用词，并从python列表列表中删除用户

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1