Python re.sub unicode 字符添加多余空格

问题描述 投票:0回答:0

我尝试使用字典用正则表达式搜索并替换文本文件中的一些unicode字符,我不知道为什么,但在这个过程中添加了一些多余的空格。

我的代码:

# coding=utf8

import re

syr_unicodes_dict = {
    '\u0712\u073F': '\u0712\u0741', # LETTER BETH + RWAHA -> QUSHSHAYA
    '\u0713\u073F': '\u0713\u0741', # LETTER GAMAL + RWAHA -> QUSHSHAYA
    '\u0715\u073F': '\u0715\u0741', # LETTER DALATH + RWAHA -> QUSHSHAYA
    '\u071F\u073F': '\u071F\u0741', # LETTER KAPH + RWAHA -> QUSHSHAYA
    '\u0726\u073F': '\u0726\u0741', # LETTER PE + RWAHA -> QUSHSHAYA
    '\u072C\u073F': '\u072C\u0741', # LETTER TAW + RWAHA -> QUSHSHAYA
    '\u0712\u073C': '\u0712\u0742', # LETTER BETH + HBASA-ESASA DOTTED -> RUKKAKHA
    '\u0713\u073C': '\u0713\u0742', # LETTER GAMAL + HBASA-ESASA DOTTED -> RUKKAKHA
    '\u0715\u073C': '\u0715\u0742', # LETTER DALATH + HBASA-ESASA DOTTED -> RUKKAKHA
    '\u071F\u073C': '\u071F\u0742', # LETTER KAPH + HBASA-ESASA DOTTED -> RUKKAKHA
    '\u0726\u073C': '\u0726\u0742', # LETTER PE + HBASA-ESASA DOTTED -> RUKKAKHA
    '\u072C\u073C': '\u072C\u0742' # LETTER TAW + HBASA-ESASA DOTTED -> RUKKAKHA
}

print('length of Hebrew_unicodes_dict is ' + str(len(syr_unicodes_dict)))

text_file = open('./matthew.txt', 'r', encoding = 'utf-8')
revised_text_file = open('./matthew_fixed.txt', 'w')

with text_file, revised_text_file:
    for line in text_file:
        for old_value, new_value in (syr_unicodes_dict.items()):
            new_line = re.sub(r''+old_value+'', r''+new_value+' ', line, 1)
            line = new_line
        revised_text_file.write(new_line)

我的文字示例:

ܟܬܼܳܒܼܳܐ ܕܺܝܠܺܝܕܼܽܘܬܼܶܗ ܕܝܶܫܽܘܥ ܡܫܺܝܚܳܐ܃ ܒܪܶܗ ܕܕܼܰܘܺܝܕܼ܂ ܒܪܶܗ ܕܰܐܒܼܪܳܗܳܡ܀܀

我得到了什么:

ܟܬ݂ ܳܒ݂ ܳܐ ܕ݁ ܝܠܺܝܕ݂ ܽܘܬܼܶܗ ܕܝܶܫܽܘܥ ܡܫܺܝܚܳܐ܃ ܒܪܶܗ ܕܕܼܰܘܺܝܕܼ܂ ܒܪܶܗ ܕܰܐܒܼܪܳܗܳܡ܀܀

我应该得到什么:

ܟܬ݂ܳܒ݂ܳܐ ܕܺܝܠܺܝܕ݂ܽܘܬ݂ܶܗ ܕܝܶܫܽܘܥ ܡܫܺܝܚܳܐ܃ ܒܪܶܗ ܕܕ݂ܰܘܺܝܕ݂܂ ܒܪܶܗ ܕܰܐܒ݂ܪܳܗܳܡ܀܀

python regex unicode
© www.soinside.com 2019 - 2024. All rights reserved.