使用string.punctuation删除字符串的标点符号时出错

Question

快速问题：

我正在使用string和nltk.stopwords剥离其所有标点符号和停用词的文本块，作为数据预处理的一部分，然后再将其输入某些自然语言处理算法中。

我已经在几个原始文本上分别测试了每个组件，因为我仍然习惯于此过程，而且看起来还不错。

    def text_process(text):
        """
        Takes in string of text, and does following operations: 
        1. Removes punctuation. 
        2. Removes stopwords. 
        3. Returns a list of cleaned "tokenized" text.
        """
        nopunc = [char for char in text.lower() if char not in string.punctuation]

        nopunc = ''.join(nopunc)

        return [word for word in nopunc.split() if word not in 
               stopwords.words('english')]

但是，当我将此功能应用于数据框的文本列时–是一堆Pitchfork评论中的文本–我可以看到实际上并没有删除标点符号，尽管停用词已删除。

未处理：

    pitchfork['content'].head(5)

0    “Trip-hop” eventually became a ’90s punchline,...
1    Eight years, five albums, and two EPs in, the ...
2    Minneapolis’ Uranium Club seem to revel in bei...
3    Minneapolis’ Uranium Club seem to revel in bei...
4    Kleenex began with a crash. It transpired one ...
Name: content, dtype: object

已处理：

    pitchfork['content'].head(5).apply(text_process)


0    [“triphop”, eventually, became, ’90s, punchlin...
1    [eight, years, five, albums, two, eps, new, yo...
2    [minneapolis’, uranium, club, seem, revel, agg...
3    [minneapolis’, uranium, club, seem, revel, agg...
4    [kleenex, began, crash, it, transpired, one, n...
Name: content, dtype: object

对这里出什么问题有任何想法吗？我仔细阅读了文档，但没有看到任何人以完全相同的方式为这个问题而苦苦挣扎，因此，我希望对如何解决这个问题有一些见解。非常感谢！

Answer 1

这里的问题是utf-8对左右引号（单引号和双引号）具有不同的编码，而不仅仅是string.punctuation中包含的常规引号。

我会做类似的事情

punctuation = [ c for c in string.punctuation ] + [u'\u201c',u'\u201d',u'\u2018',u'\u2019'] nopunc = [ char for char in text.decode('utf-8').lower() if char not in punctuation ]

这会将非ASCII引号的utf-8值添加到名为punctuation的列表中，然后将文本解码为utf-8，并替换这些值。

注意：这是python2，如果您使用的是python3，则utf值的格式可能会略有不同]

使用string.punctuation删除字符串的标点符号时出错

问题描述投票：0回答：1

1个回答

最新问题

使用string.punctuation删除字符串的标点符号时出错

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1