使用Encode :: encode with“utf8”

问题描述 投票:5回答:1

你可能知道,在Perl中,“utf8”意味着Perl对UTF-8的理解更为宽松,它允许在技术上不是UTF-8中有效代码点的字符。相比之下,“UTF-8”(或“utf-8”)是Perl对UTF-8更严格的理解,它不允许无效的代码点。

我有一些与此区别相关的使用问题:

  1. 默认情况下,Encode :: encode将使用替换字符替换无效字符。即使您将更宽松的“utf8”作为编码传递,这是真的吗?
  2. 当您使用“UTF-8”读取和写入open文件时会发生什么?角色替换是否发生在糟糕的角色上或是否发生了其他事情?
  3. 使用open和'>:utf8'这样的层和像'>:encoding(utf8)'这样的层有什么区别?这两种方法都可以同时用于'utf8'和'UTF-8'吗?
perl
1个回答
10
投票
                   ╔════════════════════════════════════════════╤══════════════════════╗
                   ║                                            │                      ║
                   ║                  On Read                   │       On Write       ║
                   ║                                            │                      ║
        Perl       ╟─────────────────────┬──────────────────────┼──────────────────────╢
        5.26       ║                     │                      │                      ║
                   ║ Invalid encoding    │ Outside of Unicode,  │ Outside of Unicode,  ║
                   ║ other than sequence │ Unicode nonchar, or  │ Unicode nonchar, or  ║
                   ║ length              │ Unicode surrogate    │ Unicode surrogate    ║
                   ║                     │                      │                      ║
╔══════════════════╬═════════════════════╪══════════════════════╪══════════════════════╣
║                  ║                     │                      │                      ║
║ :encoding(UTF-8) ║ Warns and Replaces  │ Warns and Replaces   │ Warns and Replaces   ║
║                  ║                     │                      │                      ║
╟──────────────────╫─────────────────────┼──────────────────────┼──────────────────────╢
║                  ║                     │                      │                      ║
║ :encoding(utf8)  ║ Warns and Replaces  │ Accepts              │ Warns and Accepts    ║
║                  ║                     │                      │                      ║
╟──────────────────╫─────────────────────┼──────────────────────┼──────────────────────╢
║                  ║                     │                      │                      ║
║ :utf8            ║ Corrupt scalar      │ Accepts              │ Warns and Accepts    ║
║                  ║                     │                      │                      ║
╚══════════════════╩═════════════════════╧══════════════════════╧══════════════════════╝

Click here if you have trouble viewing the above table

请注意,:encoding(UTF-8)实际上使用utf8解码,然后检查范围是否允许字符(因为它识别"\x{20_000}"甚至"\x{1000_0000_0000_0000}")。这减少了错误消息的数量,因此它很好。

(编码名称不区分大小写。)


用于生成上表的测试:

在读

  • :encoding(UTF-8) printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" | perl -MB -nle' use open ":std", ":encoding(UTF-8)"; my $sv = B::svref_2object(\$_); printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_); ' utf8 "\xFFFF" does not map to Unicode. utf8 "\xD800" does not map to Unicode. utf8 "\x200000" does not map to Unicode. utf8 "\x80" does not map to Unicode. E9 (internal: C3.A9, UTF8=1) 5C.78.7B.46.46.46.46.7D = \x{FFFF} (internal: 5C.78.7B.46.46.46.46.7D, UTF8=1) 5C.78.7B.44.38.30.30.7D = \x{D800} (internal: 5C.78.7B.44.38.30.30.7D, UTF8=1) 5C.78.7B.32.30.30.30.30.30.7D = \x{200000} (internal: 5C.78.7B.32.30.30.30.30.30.7D, UTF8=1) 5C.78.38.30 = \x80 (internal: 5C.78.38.30, UTF8=1)
  • :encoding(utf8) $ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" | perl -MB -nle' use open ":std", ":encoding(utf8)"; my $sv = B::svref_2object(\$_); printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_); ' utf8 "\x80" does not map to Unicode. E9 (internal: C3.A9, UTF8=1) FFFF (internal: EF.BF.BF, UTF8=1) D800 (internal: ED.A0.80, UTF8=1) 200000 (internal: F8.88.80.80.80, UTF8=1) 5C.78.38.30 = \x80 (internal: 5C.78.38.30, UTF8=1)
  • :utf8 $ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" | perl -MB -nle' use open ":std", ":encoding(utf8)"; my $sv = B::svref_2object(\$_); printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_); ' E9 (internal: C3.A9, UTF8=1) FFFF (internal: EF.BF.BF, UTF8=1) D800 (internal: ED.A0.80, UTF8=1) 200000 (internal: F8.88.80.80.80, UTF8=1) Malformed UTF-8 character: \x80 (unexpected continuation byte 0x80, with no preceding start byte) in printf at -e line 4, <> line 5. 0 (internal: 80, UTF8=1)

在写

  • :encoding(UTF-8) $ perl -e' use open ":std", ":encoding(UTF-8)"; print "\x{E9}\n"; print "\x{FFFF}\n"; print "\x{D800}\n"; print "\x{20_0000}\n"; ' >a Unicode non-character U+FFFF is not recommended for open interchange in print at -e line 4. Unicode surrogate U+D800 is illegal in UTF-8 at -e line 5. Code point 0x200000 is not Unicode, may not be portable in print at -e line 6. "\x{ffff}" does not map to utf8. "\x{d800}" does not map to utf8. "\x{200000}" does not map to utf8. $ od -t c a 0000000 303 251 \n \ x { F F F F } \n \ x { D 0000020 8 0 0 } \n \ x { 2 0 0 0 0 0 } \n 0000040 $ cat a é \x{FFFF} \x{D800} \x{200000}
  • :encoding(utf8) $ perl -e' use open ":std", ":encoding(utf8)"; print "\x{E9}\n"; print "\x{FFFF}\n"; print "\x{D800}\n"; print "\x{20_0000}\n"; ' >a Unicode surrogate U+D800 is illegal in UTF-8 at -e line 4. Code point 0x200000 is not Unicode, may not be portable in print at -e line 5. $ od -t c a 0000000 303 251 \n 355 240 200 \n 370 210 200 200 200 \n 0000015 $ cat a é ▒ ▒
  • :utf8:encoding(utf8)相同的结果。

使用Perl 5.26测试。


默认情况下,Encode :: encode将使用替换字符替换无效字符。即使您将更宽松的“utf8”作为编码传递,这是真的吗?

Perl字符串是32位或64位字符的字符串,具体取决于构建。 utf8可以编码任何72位整数。因此,它能够编码可以被要求编码的所有字符。

© www.soinside.com 2019 - 2024. All rights reserved.