如标题所述,我想检查给定的文件对象(作为二进制流打开)是否是有效的 UTF-8 文件。
有人吗?
谢谢
def try_utf8(data):
"Returns a Unicode object on success, or None on failure"
try:
return data.decode('utf-8')
except UnicodeDecodeError:
return None
data = f.read()
udata = try_utf8(data)
if udata is None:
# Not UTF-8. Do something else
else:
# Handle unicode data
你可以做类似的事情
import codecs
try:
f = codecs.open(filename, encoding='utf-8', errors='strict')
for line in f:
pass
print "Valid utf-8"
except UnicodeDecodeError:
print "invalid utf-8"
如果有人需要一个脚本来查找当前目录中的所有非 utf-8 文件: 导入操作系统
def try_utf8(data):
try:
return data.decode('utf-8')
except UnicodeDecodeError:
return None
for root, _, files in os.walk('.'):
if root.startswith('./.git'):
continue
for file in files:
if file.endswith('.pyc'):
continue
path = os.path.join(root, file)
with open(path, 'rb') as f:
data = f.read()
data = try_utf8(data)
if data is None:
print(path)
在Python 3中,你可以这样做:
with open(filename, 'rb') as f:
try:
f.read().decode('UTF-8')
is_utf8 = True
except UnicodeDecodeError:
is_utf8 = False
print(is_utf8)