我尝试使用字典用正则表达式搜索并替换文本文件中的一些unicode字符,我不知道为什么,但在这个过程中添加了一些多余的空格。
我的代码:
# coding=utf8
import re
syr_unicodes_dict = {
'\u0712\u073F': '\u0712\u0741', # LETTER BETH + RWAHA -> QUSHSHAYA
'\u0713\u073F': '\u0713\u0741', # LETTER GAMAL + RWAHA -> QUSHSHAYA
'\u0715\u073F': '\u0715\u0741', # LETTER DALATH + RWAHA -> QUSHSHAYA
'\u071F\u073F': '\u071F\u0741', # LETTER KAPH + RWAHA -> QUSHSHAYA
'\u0726\u073F': '\u0726\u0741', # LETTER PE + RWAHA -> QUSHSHAYA
'\u072C\u073F': '\u072C\u0741', # LETTER TAW + RWAHA -> QUSHSHAYA
'\u0712\u073C': '\u0712\u0742', # LETTER BETH + HBASA-ESASA DOTTED -> RUKKAKHA
'\u0713\u073C': '\u0713\u0742', # LETTER GAMAL + HBASA-ESASA DOTTED -> RUKKAKHA
'\u0715\u073C': '\u0715\u0742', # LETTER DALATH + HBASA-ESASA DOTTED -> RUKKAKHA
'\u071F\u073C': '\u071F\u0742', # LETTER KAPH + HBASA-ESASA DOTTED -> RUKKAKHA
'\u0726\u073C': '\u0726\u0742', # LETTER PE + HBASA-ESASA DOTTED -> RUKKAKHA
'\u072C\u073C': '\u072C\u0742' # LETTER TAW + HBASA-ESASA DOTTED -> RUKKAKHA
}
print('length of Hebrew_unicodes_dict is ' + str(len(syr_unicodes_dict)))
text_file = open('./matthew.txt', 'r', encoding = 'utf-8')
revised_text_file = open('./matthew_fixed.txt', 'w')
with text_file, revised_text_file:
for line in text_file:
for old_value, new_value in (syr_unicodes_dict.items()):
new_line = re.sub(r''+old_value+'', r''+new_value+' ', line, 1)
line = new_line
revised_text_file.write(new_line)
我的文字示例:
ܟܬܼܳܒܼܳܐ ܕܺܝܠܺܝܕܼܽܘܬܼܶܗ ܕܝܶܫܽܘܥ ܡܫܺܝܚܳܐ܃ ܒܪܶܗ ܕܕܼܰܘܺܝܕܼ܂ ܒܪܶܗ ܕܰܐܒܼܪܳܗܳܡ܀܀
我得到了什么:
ܟܬ݂ ܳܒ݂ ܳܐ ܕ݁ ܝܠܺܝܕ݂ ܽܘܬܼܶܗ ܕܝܶܫܽܘܥ ܡܫܺܝܚܳܐ܃ ܒܪܶܗ ܕܕܼܰܘܺܝܕܼ܂ ܒܪܶܗ ܕܰܐܒܼܪܳܗܳܡ܀܀
我应该得到什么:
ܟܬ݂ܳܒ݂ܳܐ ܕܺܝܠܺܝܕ݂ܽܘܬ݂ܶܗ ܕܝܶܫܽܘܥ ܡܫܺܝܚܳܐ܃ ܒܪܶܗ ܕܕ݂ܰܘܺܝܕ݂܂ ܒܪܶܗ ܕܰܐܒ݂ܪܳܗܳܡ܀܀