Python3和组合Diacritics

Question

我在python3中遇到了Unicode的问题，我似乎无法理解为什么会发生这种情况。

symbol= "ῇ̣"
print(len(symbol))
>>>>2

这封信来自一个词：ἐ̣ν̣τ̣ῇ̣[αὐτ]ῇ其中我结合了变音符号。我想在Python 3中进行统计分析并将结果存储在数据库中，事实是我还将字符的位置（索引）存储在文本中。数据库应用程序正确地将示例中的符号变量计为一个字符，而Python将其计为两个 - 抛弃整个索引。

该项目要求我保留变音符号，所以我不能简单地忽略它们或在字符串上做.replace("combining diacritical mark","")。

由于Python3将unicode作为字符串的默认值，因此我有点傻眼。

我试图使用来自希腊语重音的base()，strip()和strip_length()方法：https://pypi.org/project/greek-accentuation/，但这也没有帮助。

项目要求是：

检测属于该字符的字母（OK）
存储字符串位置（在数据库中突出显示所需）（NotOK）
能够处理混合在一个字符串中的多种语言/字母。（好）
迭代CSV输入。（好）
忽略一组预定义的字符串（OK）
忽略符合特定条件的字符串集（确定）

这是该项目的简化代码：

# -*- coding: utf-8 -*-
import csv
from alphabet_detector import AlphabetDetector
ad = AlphabetDetector()
with open("tbltext.csv", "r", encoding="utf8") as txt:
    data = csv.reader(txt)
    for row in data:
        text = row[1]
        ### Here I have some string manipulation (lowering everything, replacing the predefined set of strings by equal-length '-',...)
        ###then I use the ad-module to detect the language by looping over my characters, this is where it goes wrong.
        for letter in text:
            lang = ad.detect_alphabet(letter)

如果我使用单词：ἐ̣ν̣τ̣ῇ̣[αὐτ]ῇ作为带有for循环的示例;我的结果是：

>>> word = "ἐ̣ν̣τ̣ῇ̣[αὐτ]ῇ"
>>> for letter in word:
...     print(letter)
...
ἐ
̣
ν
̣
τ
̣
ῇ
̣
[
α
ὐ
τ
]
ῇ

如何让Python看到带有变音标记的字母作为一个字母，而不是分别打印字母和变音符号？

Answer 1

字符串长度为2，所以这是正确的：两个代码点：

>>> list(hex(ord(c)) for c in symbol)
['0x1fc7', '0x323']
>>> list(unicodedata.name(c) for c in symbol)
['GREEK SMALL LETTER ETA WITH PERISPOMENI AND YPOGEGRAMMENI', 'COMBINING DOT BELOW']

所以你不应该使用len来计算字符数。

您可以计算非组合的字符，因此：

>>> import unicodedata
>>> len(''.join(ch for ch in symbol if unicodedata.combining(ch) == 0))
1

来自：How do I get the "visible" length of a combining Unicode string in Python?（但我把它移植到python3）。

但这也不是最佳解决方案，具体取决于计算字符的范围。我认为在你的情况下它就足够了，但字体可以将字符合并为连字。在某些语言中，视觉上是新的（和非常不同的）字符（而不是西方语言中的连字）。

作为最后评论：我认为你应该规范化字符串。使用上面的代码，在这种情况下无关紧要，但在其他情况下，您可能会得到不同的结果。特别是如果有人使用了可配置性字符（例如mu用于单位，或者Eszett，而不是真正的希腊字符）。

Python3和组合Diacritics

问题描述投票：3回答：1

1个回答

最新问题

Python3和组合Diacritics

问题描述 投票：3回答：1

1个回答

最新问题

问题描述投票：3回答：1