将Python中的（UTF-8）字符串剪切为给定最大字节长度的有效方法

Question

用于存储在给定的Oracle表中（字段长度以字节为单位），我需要在Python 3中预先将字符串切成最大长度in Bytes，尽管字符串可以包含UTF-8字符。

我的解决方案是将结果字符串与原始字符串逐个字符连接，并检查结果字符串何时超过长度限制：

def cut_str_to_bytes(s, max_bytes):
    """
    Ensure that a string has not more than max_bytes bytes
    :param s: The string (utf-8 encoded)
    :param max_bytes: Maximal number of bytes
    :return: The cut string
    """
    def len_as_bytes(s):
        return len(s.encode(errors='replace'))

    if len_as_bytes(s) <= max_bytes:
        return s

    res = ""
    for c in s:
        old = res
        res += c
        if len_as_bytes(res) > max_bytes:
            res = old
            break
    return res

这显然相当慢。 什么是有效的方法？

ps：我看到了Truncate a string to a specific number of bytes in Python，但是他们使用sys.getsizeof()的解决方案并未给出字符串字符的字节数，而是给出了整个字符串对象的大小（Python需要一些字节来管理字符串对象），所以实际上并没有帮助。

Answer 1

除了多字节字符中间的任何地方都可以剪切UTF-8字符串是有效的。因此，如果要在最大字节长度内使用最长的UTF-8字符串，则需要先获取最大字节，然后再减小它，只要它的末尾有未完成的字符即可。

与您的解决方案相比，它具有O（n）的复杂度，因为它逐个字符地进行处理，因此它从末尾最多删除了3个字节（因为UTF-8字符永远不会超过4个字节）。

[RFC 3629将它们指定为有效的UTF-8字节序列：

Char. number range | UTF-8 octet sequence (hexadecimal) | (binary) --------------------+--------------------------------------------- 0000 0000-0000 007F | 0xxxxxxx 0000 0080-0000 07FF | 110xxxxx 10xxxxxx 0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx 0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

因此，使用有效的UTF-8流的最简单方法：

0xxxxxxx

否则，根据上表查找11xxxxxx在最后4个字节内的位置，以查看您是否具有完整的字符

因此，这应该起作用：

def cut_str_to_bytes(s, max_bytes): # cut it twice to avoid encoding potentially GBs of `s` just to get e.g. 10 bytes? b = s[:max_bytes].encode('utf-8')[:max_bytes] if b[-1] & 0b10000000: last_11xxxxxx_index = [i for i in range(-1, -5, -1) if b[i] & 0b11000000 == 0b11000000][0] last_11xxxxxx = b[last_11xxxxxx_index] if not last_11xxxxxx & 0b00100000: last_char_length = 2 elif not last_11xxxxxx & 0b0010000: last_char_length = 3 elif not last_11xxxxxx & 0b0001000: last_char_length = 4 if last_char_length > -last_11xxxxxx_index: b = b[:last_11xxxxxx_index] return b.decode('utf-8')

或者，您可以尝试解码最后一个字节，而不是做底层的工作，但是我不确定代码会更简单...

将Python中的（UTF-8）字符串剪切为给定最大字节长度的有效方法

问题描述投票：1回答：1

1个回答

最新问题

将Python中的（UTF-8）字符串剪切为给定最大字节长度的有效方法

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1