为什么该string。标点符号代码不适用于剥离标点符号?

问题描述 投票:1回答:1

我很困惑,为什么这段代码无法按照我想要的方式工作。我正在读取txt文件,并将每个项目(以逗号分隔)打印到新行上。每个项目都用“”括起来,并且还包含标点符号。我正在尝试删除此标点符号。我熟悉string.punctuation,并在示例中使其在测试中起作用,但是在我正在遍历的项目上失败,请参见下文:

def read_word_lists(path):
    import string
    with open(path, encoding='utf-8') as f:
        lines = f.readlines()
        for line in lines[0].split(','):
            line = str(line)
            line = line.strip().lower()
            print(''.join(word.strip(string.punctuation) for word in line))
            print(line)
            print(''.join(word.strip(string.punctuation) for word in '"why, does this work?! and not above?"'))


read_word_lists('file.txt')

结果是这样:

trying to strip punctuation:  “you never”
originial:  “you never”
test:  why does this work and not above
trying to strip punctuation:  “you always
originial:  “you always"
test:  why does this work and not above
trying to strip punctuation:  ” “your problem is”
originial:  ” “your problem is”
test:  why does this work and not above
trying to strip punctuation:  “the trouble with you is”
originial:  “the trouble with you is”
test:  why does this work and not above

有什么想法为什么“尝试删除标点符号”输出不起作用?

编辑

原始文件看起来像这样,如果有用的话:

"YOU NEVER”, “YOU ALWAYS", ” “YOUR PROBLEM IS”, “THE TROUBLE WITH YOU IS”

python regex string nlp
1个回答
0
投票

[您正在尝试剥离Unicode标点,而string.punctuation仅包含ASCII标点。

代替使用string.punctuation,您可以使用下面的代码来生成包含所有Unicode标点符号的字符串:

import unicodedata
import sys

punctuation = "".join((chr(i) for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P')))

祝你好运!

© www.soinside.com 2019 - 2024. All rights reserved.