如何在Python中编写一个检查来查看文件是否是有效的UTF-8?

问题描述 投票:0回答:4

如标题所述,我想检查给定的文件对象(作为二进制流打开)是否是有效的 UTF-8 文件。

有人吗?

谢谢

utf-8 python-2.x
4个回答
32
投票
def try_utf8(data):
    "Returns a Unicode object on success, or None on failure"
    try:
       return data.decode('utf-8')
    except UnicodeDecodeError:
       return None

data = f.read()
udata = try_utf8(data)
if udata is None:
    # Not UTF-8.  Do something else
else:
    # Handle unicode data

14
投票

你可以做类似的事情

import codecs
try:
    f = codecs.open(filename, encoding='utf-8', errors='strict')
    for line in f:
        pass
    print "Valid utf-8"
except UnicodeDecodeError:
    print "invalid utf-8"

0
投票

如果有人需要一个脚本来查找当前目录中的所有非 utf-8 文件: 导入操作系统

def try_utf8(data):
    try:
        return data.decode('utf-8')
    except UnicodeDecodeError:
        return None


for root, _, files in os.walk('.'):
    if root.startswith('./.git'):
        continue
    for file in files:
        if file.endswith('.pyc'):
            continue
        path = os.path.join(root, file)
        with open(path, 'rb') as f:
            data = f.read()
            data = try_utf8(data)
            if data is None:
                print(path)

0
投票

在Python 3中,你可以这样做:

with open(filename, 'rb') as f:
    try:
        f.read().decode('UTF-8')
        is_utf8 = True
    except UnicodeDecodeError:
        is_utf8 = False

print(is_utf8)
© www.soinside.com 2019 - 2024. All rights reserved.