如何控制包含东亚字符的Unicode字符串的填充

Question

我有三个UTF-8 st：

hello, world
hello, 世界
hello, 世rld

我只想要前10个ascii-char-width，以便将括号放在一栏中：

[hello, wor]
[hello, 世 ]
[hello, 世r]

在控制台中：

width('世界')==width('worl')
width('世 ')==width('wor')  #a white space behind '世'

一个中文字符为三个字节，但是在控制台中显示时只有2个ascii字符宽度：

>>> bytes("hello, 世界", encoding='utf-8')
b'hello, \xe4\xb8\x96\xe7\x95\x8c'

当混入UTF-8字符时，python的format()没有帮助

>>> for s in ['[{0:<{1}.{1}}]'.format(s, 10) for s in ['hello, world', 'hello, 世界', 'hello, 世rld']]:
...    print(s)
...
[hello, wor]
[hello, 世界 ]
[hello, 世rl]

不太漂亮：

 -----------Songs-----------
|    1: 蝴蝶                  |
|    2: 心之城                 |
|    3: 支持你的爱人              |
|    4: 根生的种子               |
|    5: 鸽子歌(CUCURRUCUCU PALO|
|    6: 林地之间                |
|    7: 蓝光                  |
|    8: 在你眼里                |
|    9: 肖邦离别曲               |
|   10: 西行( 魔戒王者再临主题曲)(INTO |
| X 11: 深陷爱河                |
| X 12: 钟爱大地(THE MO RUN AIR |
| X 13: 时光流逝                |
| X 14: 卡农                  |
| X 15: 舒伯特小夜曲(SERENADE)    |
| X 16: 甜蜜的摇篮曲(Sweet Lullaby|
 ---------------------------

所以，我想知道是否有标准的方法来执行UTF-8填充人员？

Answer 1

# coding: utf8

# full width versions (SPACE is non-contiguous with ! through ~)
SPACE = '\N{IDEOGRAPHIC SPACE}'
EXCLA = '\N{FULLWIDTH EXCLAMATION MARK}'
TILDE = '\N{FULLWIDTH TILDE}'

# strings of ASCII and full-width characters (same order)
west = ''.join(chr(i) for i in range(ord(' '),ord('~')))
east = SPACE + ''.join(chr(i) for i in range(ord(EXCLA),ord(TILDE)))

# build the translation table
full = str.maketrans(west,east)

data = '''\
蝴蝶(A song)
心之城(Another song)
支持你的爱人(Yet another song)
根生的种子
鸽子歌(Cucurrucucu palo whatever)
林地之间
蓝光
在你眼里
肖邦离别曲
西行（魔戒王者再临主题曲）(Into something)
深陷爱河
钟爱大地
时光流逝
卡农
舒伯特小夜曲(SERENADE)
甜蜜的摇篮曲(Sweet Lullaby)
'''

# Replace the ASCII characters with full width, and create a song list.
data = data.translate(full).rstrip().split('\n')

# translate each printable line.
print(' ----------Songs-----------'.translate(full))
for i,song in enumerate(data):
    line = '|{:4}: {:20.20}|'.format(i+1,song)
    print(line.translate(full))
print(' --------------------------'.translate(full))

输出

　－－－－－－－－－－Ｓｏｎｇｓ－－－－－－－－－－－
｜　　　１：　蝴蝶（Ａ　ｓｏｎｇ）　　　　　　　　　　｜
｜　　　２：　心之城（Ａｎｏｔｈｅｒ　ｓｏｎｇ）　　　｜
｜　　　３：　支持你的爱人（Ｙｅｔ　ａｎｏｔｈｅｒ　ｓ｜
｜　　　４：　根生的种子　　　　　　　　　　　　　　　｜
｜　　　５：　鸽子歌（Ｃｕｃｕｒｒｕｃｕｃｕ　ｐａｌｏ｜
｜　　　６：　林地之间　　　　　　　　　　　　　　　　｜
｜　　　７：　蓝光　　　　　　　　　　　　　　　　　　｜
｜　　　８：　在你眼里　　　　　　　　　　　　　　　　｜
｜　　　９：　肖邦离别曲　　　　　　　　　　　　　　　｜
｜　　１０：　西行（魔戒王者再临主题曲）（Ｉｎｔｏ　ｓ｜
｜　　１１：　深陷爱河　　　　　　　　　　　　　　　　｜
｜　　１２：　钟爱大地　　　　　　　　　　　　　　　　｜
｜　　１３：　时光流逝　　　　　　　　　　　　　　　　｜
｜　　１４：　卡农　　　　　　　　　　　　　　　　　　｜
｜　　１５：　舒伯特小夜曲（ＳＥＲＥＮＡＤＥ）　　　　｜
｜　　１６：　甜蜜的摇篮曲（Ｓｗｅｅｔ　Ｌｕｌｌａｂｙ｜
　－－－－－－－－－－－－－－－－－－－－－－－－－－

不是太漂亮，但是排列。

Answer 2

也许我不明白您的问题，但是看起来您得到的输出是

完全您想要的，

除外汉字的字体更宽。

所以UTF-8是一个鲱鱼，因为我们不是在谈论bytes，而是在谈论
characters
。您使用的是Python 3，因此所有字符串都是Unicode。基本字节表示形式（其中每个汉字由三个字节表示）无关紧要。您想将每个字符串剪切或填充为正好10个字符，并且工作正常：>>> len('hello, wor') 10 >>> len('hello, 世界 ') 10 >>> len('hello, 世rl') 10
唯一的问题是您正在使用看起来是等宽字体的字体来查看它，但实际上是
不是
。大多数等宽字体都有此问题。该字体中所有普通拉丁字符的宽度完全相同，但是中文字符稍宽。因此，三个字符"世界 "比三个字符"wor"占用更多的水平空间。除了a）获得真正等距的字体，或b）精确计算出每个字符在字体中的宽度，并添加一些空格，这些空格大约可以将您带到字体之外，您无能为力。相同的水平位置（永远不会准确）。

Answer 3

>>> import unicodedata
>>> print unicodedata.east_asian_width(u'中')

返回的值表示category of the code point。具体来说，

W-东亚宽幅

Na-东亚狭窄
H-东亚半角（宽）
A-东亚Am昧
N-不是东亚

This answer类似问题提供了一种快速的解决方案。但是请注意，显示结果取决于所使用的确切等宽字体。在Windows控制台正常的情况下，ipython和pydev使用的默认字体无法正常工作。

Answer 4

data = '''\
蝴蝶(A song)
心之城(Another song)
支持你的爱人(Yet another song)
根生的种子
鸽子歌(Cucurrucucu palo whatever)
林地之间
蓝光
在你眼里
肖邦离别曲
西行（魔戒王者再临主题曲）(Into something)
深陷爱河
钟爱大地
时光流逝
卡农
舒伯特小夜曲(SERENADE)
甜蜜的摇篮曲(Sweet Lullaby)'''

width = 80

def get_aligned_string(string,width):
    string = "{:{width}}".format(string,width=width)
    bts = bytes(string,'utf-8')
    string = str(bts[0:width],encoding='utf-8',errors='backslashreplace')
    new_width = len(string) + int((width - len(string))/2)
    if new_width!=0:
        string = '{:{width}}'.format(str(string),width=new_width)
    return string

for i,line in enumerate(data.split('\n')):
    song = get_aligned_string(line,width)
    line = '|{:4}: {:}|'.format(i+1,song)
    print(line)

输出

如何控制包含东亚字符的Unicode字符串的填充

问题描述投票：7回答：5

5个回答

不是太漂亮，但是排列。

最新问题

如何控制包含东亚字符的Unicode字符串的填充

问题描述 投票：7回答：5

5个回答

不是太漂亮，但是排列。

最新问题

问题描述投票：7回答：5