libicu 的 Unicode 规范化 C++ 错误不正确

问题描述 投票:0回答:1

我正在实现一个 C++ 类,该类具有一个静态方法,可将 UTF-8 字符串范式 D (NFD) 转换为范式 D (NFC)。

当前方法实现如下:

std::string Graphemes::normalizeToNFC(const std::string& input) {
  icu::UnicodeString source = icu::UnicodeString::fromUTF8(input);
  icu::UnicodeString target;

  UErrorCode status = U_ZERO_ERROR;
  icu::Normalizer::normalize(source, UNORM_NFC, 0, target, status);

  if (U_FAILURE(status)) {
    CONSOLELOG_ERROR("Normalization failed: %s", u_errorName(status));
    return {};
  }

  std::string result;
  target.toUTF8String(result);
  return result;
}

我对该方法的单元测试如下:

TEST(ConversionTest, Utf8ToLatin2) {
  // "Soluções e Ações" at Normal Form D
  char input_UTF8_NFD[] = "\x53\x6F\x6C\x75\xE7\xF5\x65\x20\x65\x20\x41\xE7\xF5\x65\x73\x00";
  // "Soluções e Ações" at Normal Form C
  char expected_UTF8_NFC[] = "\x53\x6F\x6C\x75\xC3\xA7\xC3\xB5\x65\x73\x20\x65\x20\x41\xC3\xA7\xC3\xB5\x65\x73\x0A\x00";

  std::string NFCStr = Graphemes::normalizeToNFC(input_UTF8_NFD);
  std::string expected = expected_UTF8_NFC;

  ASSERT_EQ(expected, NFCStr);
}

我尝试运行测试,但得到了以下不一致的结果:

这些值的预期相等:

Expected equality of these values:
  expected
    Which is: "Solu\xC3\xA7\xC3\xB5" "es e A\xC3\xA7\xC3\xB5" "es\n"
    As Text: "Soluções e Ações
"
  rc1
    Which is: "Solu\xEF\xBF\xBD\xEF\xBF\xBD" "e e A\xEF\xBF\xBD\xEF\xBF\xBD" "es"
    As Text: "Solu��e e A��es"
With diff:
@@ -1,2 +1,1 @@
-Solu\xC3\xA7\xC3\xB5" "es e A\xC3\xA7\xC3\xB5" "es
-"
    As Text: "Soluções e Ações

+Solu\xEF\xBF\xBD\xEF\xBF\xBD" "e e A\xEF\xBF\xBD\xEF\xBF\xBD" "es"
    As Text: "Solu��e e A��es

我注意到在这种情况下 libicu 没有正确转换正常形式。您能否在此实施中提供一些建议?

c++ unicode normalization
1个回答
0
投票

按照评论上的提示,我成功地实现了我的测试,如下:

TEST(NormalizationTest, NFD_To_NFC) {
  // NFD and NFC sequences for "Soluções e Ações" :D
  char input_UTF8_NFD[] = "\x53\x6F\x6C\x75\x63\xCC\xA7\x6F\xCC\x83\x65\x73\x20\x65\x20\x41\x63\xCC\xA7\x6F\xCC\x83\x65\x73\x00";
  char expected_UTF8_NFC[] = "\x53\x6F\x6C\x75\xC3\xA7\xC3\xB5\x65\x73\x20\x65\x20\x41\xC3\xA7\xC3\xB5\x65\x73\x00";

  std::string NFCStr = Graphemes::normalizeToNFC(input_UTF8_NFD);
  std::string expected = expected_UTF8_NFC;

  ASSERT_EQ(expected, NFCStr);
}

现在正在按预期工作。

© www.soinside.com 2019 - 2024. All rights reserved.