使用Python删除阿拉伯字符文本中的特殊字符，数字

Question

我只想保留阿拉伯字符，没有数字，我从github得到了这个正则表达式指令。

    generalPath="C:/Users/Desktop/Code/dataset/"
    outputPath= "C:/Users/Desktop/Code/output/"
    files = os.listdir(generalPath)

    for onefile in files:
    # relative or absolute file path, e.g.:
        localPath=generalPath+onefile
        localOutputPath=outputPath+onefile
        print(localPath)
        print(localOutputPath)
        with open(localPath, 'rb') as infile, open(localOutputPath, 'w') as outfile:
            data = infile.read().decode('utf-8')
            new_data = t = re.sub(r'[^0-9\u0600-\u06ff\u0750-\u077f\ufb50-\ufbc1\ufbd3-\ufd3f\ufd50-\ufd8f\ufd50-\ufd8f\ufe70-\ufefc\uFDF0-\uFDFD]+', ' ', data)
            outfile.write(new_data)

在此代码中，我收到此错误：追溯（最近一次通话）：文件“。\ cleanText.py”，第23行，在outfile.write（new_data）文件“ C：\ ProgramData \ Anaconda3 \ lib \ encodings \ cp1252.py”，第19行，编码返回codecs.charmap_encode（input，self.errors，encoding_table）[0]UnicodeEncodeError：“ charmap”编解码器无法对位置0-2中的字符进行编码：字符映射到

我的阿拉伯文字被吓死了，我想这样保留

Answer 1

看来您的程序正在尝试使用CP1252编码而不是UTF-8读取文本文件。如下所示，在打开时指定unicode。另外，由于它是文本文件，因此您可以使用'r'而不是'rb'进行读取。

with open(localPath, 'r', encoding='utf8') as infile

至于正则表达式，如果您只想删除数字，则可以使用

data = re.sub(r'[0-9]+', '', data)

您无需将整个阿拉伯字母指定为要保留的字符。但看起来您有类似“（1/6）”的字符串。要也除去所有括号和斜杠，请使用：

data = re.sub(r'[0-9\(\)/]+', '', data)

使用Python删除阿拉伯字符文本中的特殊字符，数字

问题描述投票：1回答：1

1个回答

最新问题

使用Python删除阿拉伯字符文本中的特殊字符，数字

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1