在文件中搜索非Unicode字符

问题描述投票：0回答：1

我有一个文本块（db的摘录），我想找到非unicode字符，因为在我将值转换为python 3.6时代码的某些部分（str）我得到以下ValueError: character U+ffffffc2 is not in range [U+0000; U+10ffff]

所以，如果我能找到非unicode字符，我可以决定如何处理它们。我绝对不想用别的东西替换它们。

我发现如何在grep --color='auto' -P -n '[^\x00-\x7F]' file_name.txt文件中找到非ascii字符，但我不确定这是否也给了我非unicode字符。

unicode character-encoding

1个回答

0
投票

见http://p3rl.org/Encode#coderef-for-CHECK

# contains U+ffffffc2 encoded in UTF-8
› hex nonunicodefile
0000  61 62 63 fe 83 bf bf bf  bf 82 78 79 7a           abc..... ..xyz

› perl -MEncode -lne'
    # replace junk with empty string
    my $line = decode "UTF-8", $_, sub { "" };
    print encode "UTF-8", $line;
' < nonunicodefile
abcxyz

最新问题

© www.soinside.com 2019 - 2024. All rights reserved.